Quidest?

Data Warehouse and BigQuery

OLAP vs OLTP

OLTP stands for ONline Transactions Processing, and OLAP stands for ONline Analytics Processing

OLTP is for backend purposes, while OLAP is used by data analysts or data scientists to discover insights.

OLAPOLTP
PurposeControl and run essential business operations in real timePlan, solve problems, support decisions, discover hidden insights
Data updatesShort, fast updates initiated by the userData periodically refreshed with scheduled, long running batch jobs
Database designNormalized databases for efficiencyDenormalized databases for analysis
Space requirementsGenerally small if historical data is archivedGenerally large due to aggregating large datasets
Backup and recoveryRegular BackupsLost data can be recovered from OLTP insted of regular backups
ProductivityIncreases productivity of end userIncreases productivity of managers and data analysts
Data viewLists day-to-day businessMulti-dimensional view of enterprise data
User examplesCustomer-facing personnel, clerks, online shopperKnowledge workers and executives

A data warehouse is a OLAP solution used for reporting and data analyses. It consists of raw data, metadata and summaries. They have many data sources.

data warehouse

Data Warehouse can output to Data Marts (A data mart is a focused, smaller database containing a subset of data from a larger data warehouse, designed for a specific department (like Sales or Marketing) or business function, providing faster, easier access for targeted analysis, reporting, and business intelligence), but can also provide their raw output.

BigQuery

BigQuery is a serverless data warehouse. It provides both software and infrastructure, with scalability and availability in mind.

You can do ML via SQL, handle geospatial data and provide business intelligence solutions

BigQuery is flexible in how it handles data. BQ separates the compute engine that analyzes the data, from storage.

BQ has two pricing models:

BigQuery SQL Table Creation

1-- Query public available table
2SELECT station_id, name FROM
3    bigquery-public-data.new_york_citibike.citibike_stations
4LIMIT 100;
1-- Creating external table referring to gcs path
2CREATE OR REPLACE EXTERNAL TABLE `taxi-rides-ny.nytaxi.external_yellow_tripdata`
3OPTIONS (
4  format = 'CSV',
5  uris = ['gs://nyc-tl-data/trip data/yellow_tripdata_2019-*.csv', 'gs://nyc-tl-data/trip data/yellow_tripdata_2020-*.csv']
6);

When creating an external table (a table from an external resource), bq is not able to determine the size and number of rows

Partitioning

Partitioning

Generally when we create a dataset, we have columns, whose values can repeat. Partitioning can improve bq performance, by creating “buckets”, partitions of the raw dataset based on a columns value, like the date, improving cost and speed by processing less data upon runtime.

A partitioned table is divided into segments, called partitions, that make it easier to manage and query your data. By dividing a large table into smaller partitions, you can improve query performance and control costs by reducing the number of bytes read by a query. You partition tables by specifying a partition column which is used to segment the table.

 1-- Check yellow trip data
 2SELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata limit 10;
 3
 4-- Create a non partitioned table from external table
 5CREATE OR REPLACE TABLE taxi-rides-ny.nytaxi.yellow_tripdata_non_partitioned AS
 6SELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata;
 7
 8
 9-- Create a partitioned table from external table
10CREATE OR REPLACE TABLE taxi-rides-ny.nytaxi.yellow_tripdata_partitioned
11PARTITION BY
12  DATE(tpep_pickup_datetime) AS
13SELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata;
14
15-- Impact of partition
16-- Scanning 1.6GB of data
17SELECT DISTINCT(VendorID)
18FROM taxi-rides-ny.nytaxi.yellow_tripdata_non_partitioned
19WHERE DATE(tpep_pickup_datetime) BETWEEN '2019-06-01' AND '2019-06-30';
20
21-- Scanning ~106 MB of DATA
22SELECT DISTINCT(VendorID)
23FROM taxi-rides-ny.nytaxi.yellow_tripdata_partitioned
24WHERE DATE(tpep_pickup_datetime) BETWEEN '2019-06-01' AND '2019-06-30';
25
26-- Let's look into the partitions
27SELECT table_name, partition_id, total_rows
28FROM `nytaxi.INFORMATION_SCHEMA.PARTITIONS`
29WHERE table_name = 'yellow_tripdata_partitioned'
30ORDER BY total_rows DESC;

Clustering

Clustering

Clustered tables in BigQuery are tables that have a user-defined column sort order using clustered columns. Clustered tables can improve query performance and reduce query costs.

In BigQuery, a clustered column is a user-defined table property that sorts storage blocks based on the values in the clustered columns. The storage blocks are adaptively sized based on the size of the table.

When you create a clustered table in BigQuery, the table data is automatically organized based on the contents of one or more columns in the table’s schema. The columns you specify are used to colocate related data. When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.

 1-- Creating a partition and cluster table
 2CREATE OR REPLACE TABLE taxi-rides-ny.nytaxi.yellow_tripdata_partitioned_clustered
 3PARTITION BY DATE(tpep_pickup_datetime)
 4CLUSTER BY VendorID AS
 5SELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata;
 6
 7-- Query scans 1.1 GB
 8SELECT count(*) as trips
 9FROM taxi-rides-ny.nytaxi.yellow_tripdata_partitioned
10WHERE DATE(tpep_pickup_datetime) BETWEEN '2019-06-01' AND '2020-12-31'
11  AND VendorID=1;
12
13-- Query scans 864.5 MB
14SELECT count(*) as trips
15FROM taxi-rides-ny.nytaxi.yellow_tripdata_partitioned_clustered
16WHERE DATE(tpep_pickup_datetime) BETWEEN '2019-06-01' AND '2020-12-31'
17  AND VendorID=1;

Clustering vs Partitioning

When creating a partition table, you can choose to partition by time-unit column, ingestion time or integer-range partitioning. Number of partition limit is 4000.

When choosing time-unit or ingestion-time partitioning, you can select to partition by day (default), hour, month or year.

When clustering, the columns you specify are used to co-locate related data; the order of the column is important because it determines the sort order of the data. You can specify up to 4 clustering columns. The clustering columns must be top-level and non-repeating columns.

Clustering improves filter and aggregate queries.

It makes sense to have clustering or partitioning for data > 1GB; for tables smaller than 1GB, the overhead added by these can defeat the advantages.

ClusteringPartitioning
Cost benefit unknownCost known upfront
You need more granularity than partitioning allowsYou need partition-level management
Your queries commonly use filters or aggregation against multiple particular columnsFilter or aggregate on single column
The cardinality of the number of values in a column is large

You would cluster instead of partition if:

Automatic Reclustering

As data is added to a clustered table:

To maintain the performance characteristics of a clustered table:

BigQuery Best Practices

  1. Cost Reduction
  1. Query Performance:

Internals of BigQuery

Internals

BigQuery stores the data in a separete storage called Colossus. It is a cheap storage that stores data in a column format. Since storage is separated from compute, it has significantly less cost. The most cost intensive task is reading the data itself, which is mostly compute.

Since compute and storage are on different hardware, how do they communicate? If network is bad, it can affect speed. That’s where Jupyter network playes a role. It is a network inside BQ datacenters, and provides 1TB/s network speed, allowing compute and storage to be on separated hardware while talking without any delays.

The third component of BQ is Dremel: it is bq execution engine; it divides queries into a tree structure, and separates queries in such a way that each tree node executes an individual subset of a query.

Columnar and Record-oriented storage

Columnar and Record-oriented storage

A record-oriented storage is the typical storage structure that we can find in CSV; each record (column) is their own entity, separated by a delimited (newline for CSV).

Column-oriented storage organizes data by column, storing all values for a single column together on disk, unlike traditional row-oriented systems that store all data for a single record sequentially. This method significantly speeds up analytical queries (like sums, averages, min/max) by allowing systems to read only the relevant columns, reducing I/O, and enabling efficient compression because data within a column is of the same type, making it ideal for data warehousing and big data analytics.

How it works:

Dremel

Internals

Dremel will modifiy the query that it receives into a number of subqueries that are delegated to mixers, which will further divide their own queries into subqueries until this process cannot be performed anymore, and the final subqueries are given to leaf nodes. These leaf nodes are the workers that will actually interface with Colossus to retrieve the data in parallel (MapReduce) and processes it and deliver it to the parent nodes until the original query is satisfied.

BigQuery ML

#data-engineering #study-plan #career-development #zoomcamp #data marts