Quidest?

Docker and Postgres

A data pipeline is a processing service that gets in data and produces more data

Docker is a software platform that allows you to build, test, and deploy applications quickly. Docker packages software into standardized units called containers that have everything the software needs to run including libraries, system tools, code, and runtime. Using Docker, you can quickly deploy and scale applications into any environment and know your code will run.

We can have multiple databases and pipelines running in separate docker containers, completely separated from each other.

Docker images are snapshots of a container; they contain all the instructions needed to create a specific environment. This gurantees reproducibility anywhere.

Data Engineers care about Docker because:

Basics

We can pull images from Docker Hub and run them directly in our machine:

docker run -it ubuntu will pull and create an interactive ubuntu container running bash

docker run -it python:3.9 will pull a ubuntu image with Python 3.9, and run the Python interpreter.

If we want to create the python:3.9 container, but starting it with bash instead of with Python (maybe because we want to install some libraries like pandas), we can change the entrypoint:

docker run -it --entrypoint=bash python:3.9

Dockerfile

A Dockerfile is a list of instructions on how to create and run a Docker image. For example, we can translate the command docker run -it --entrypoint=bash python:3.9 into

1FROM python:3.9
2
3RUN pip install pandas
4
5ENTRYPOINT ["bash"]

and then docker build -t test:pandas ., which will build an image with tag test:pandas using the Dockerfile in the current dir. This image will contain python3.9 and pandas, and when we run docker run -it test:pandas, the container will start in interactive mode with bash

Dockerize Scripts

Let’s assume we have a script pipeline.py, that we want to run from a container with the required dependencies (e.g. pandas), something as simple as

1import pandas as pd
2import sys
3
4day = sys.argv[1]
5print(f"Job finished successfully for day = {day}.")

then we can write the following Dockerfile:

1FROM python:3.9
2
3RUN pip install pandas
4
5WORKDIR /app
6
7COPY pipeline.py pipeline.py
8
9ENTRYPOINT [ "python", "pipeline.py" ]

We can build like before, and when running something like docker run test:pandas 2023-01-01, it will print Job finished successfully for day = 2023-01-01.

Postgres in Docker

We can run a postgres database in docker with the following command, that will pull an image, set environment variables for username, password and database name, and map a port inside the container (5432) to a host port (5433); it will also map a host volume (ny_taxi_postgres_data) to a container volume; this allows for data persistency when taking down or setting up a container.

1docker run -it -e POSTGRES_USER=root \
2               -e POSTGRES_PASSWORD=root \
3               -e POSTGRES_DB=ny_taxi \
4               -v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data \
5               -p 5433:5432 postgres:13

Once the postgres database is up and running, we can query with a CLI tool called pgcli; just run

1pgcli --host localhost --port 5433 --user root --password root --dbname ny_taxi

and then run queries in the interactive cli with autocompletion.

Ingesting Data

Let’s assume we have a dataset in CSV that we want to ingest into postgres. We can do this in Python; we first need to install the following libraries: psycopg2-binary, pandas, sqlalchemy

We can inspect the first 100 rows of a dataset with

1import pandas as pd
2
3df = pd.read_csv("./yellow_taxi_data.csv.gz", nrows=100)
4df.head(10)

If we want to ingest the data into postgres, we will need to generate the sql schema. We can do that with

 1print(pd.io.sql.get_schema(df, name='yellow_taxi_trips'))
 2
 3# CREATE TABLE "yellow_taxi_trips" (
 4# "VendorID" INTEGER,
 5#   "tpep_pickup_datetime" TEXT,
 6#   "tpep_dropoff_datetime" TEXT,
 7#   "passenger_count" INTEGER,
 8#   "trip_distance" REAL,
 9#   "RatecodeID" INTEGER,
10#   "store_and_fwd_flag" TEXT,
11#   "PULocationID" INTEGER,
12#   "DOLocationID" INTEGER,
13#   "payment_type" INTEGER,
14#   "fare_amount" REAL,
15#   "extra" REAL,
16#   "mta_tax" REAL,
17#   "tip_amount" REAL,
18#   "tolls_amount" REAL,
19#   "improvement_surcharge" REAL,
20#   "total_amount" REAL,
21#   "congestion_surcharge" REAL
22# )

We can see that some datetime fields are read as text. We can convert that with

1df["tpep_pickup_datetime"] = pd.to_datetime(df["tpep_pickup_datetime"])
2df["tpep_dropoff_datetime"] = pd.to_datetime(df["tpep_dropoff_datetime"])

and now

 1print(pd.io.sql.get_schema(df, name='yellow_taxi_trips'))
 2
 3# CREATE TABLE "yellow_taxi_trips" (
 4# "VendorID" INTEGER,
 5#   "tpep_pickup_datetime" TIMESTAMP,
 6#   "tpep_dropoff_datetime" TIMESTAMP,
 7#   "passenger_count" INTEGER,
 8#   "trip_distance" REAL,
 9#   "RatecodeID" INTEGER,
10#   "store_and_fwd_flag" TEXT,
11#   "PULocationID" INTEGER,
12#   "DOLocationID" INTEGER,
13#   "payment_type" INTEGER,
14#   "fare_amount" REAL,
15#   "extra" REAL,
16#   "mta_tax" REAL,
17#   "tip_amount" REAL,
18#   "tolls_amount" REAL,
19#   "improvement_surcharge" REAL,
20#   "total_amount" REAL,
21#   "congestion_surcharge" REAL
22# )

However, we must get schema specific to postgresql dialect. We can do that by telling pandas what database we are working with. To do that, we need to connect to our database first:

 1from sqlalchemy import create_engine
 2
 3engine = create_engine("postgresql://root:root@localhost:5433/ny_taxi")
 4con = engine.connect()
 5
 6print(pd.io.sql.get_schema(df, name='yellow_taxi_trips', con=con))
 7
 8# CREATE TABLE yellow_taxi_trips (
 9#         "VendorID" BIGINT,
10#         tpep_pickup_datetime TIMESTAMP WITHOUT TIME ZONE,
11#         tpep_dropoff_datetime TIMESTAMP WITHOUT TIME ZONE,
12#         passenger_count BIGINT,
13#         trip_distance FLOAT(53),
14#         "RatecodeID" BIGINT,
15#         store_and_fwd_flag TEXT,
16#         "PULocationID" BIGINT,
17#         "DOLocationID" BIGINT,
18#         payment_type BIGINT,
19#         fare_amount FLOAT(53),
20#         extra FLOAT(53),
21#         mta_tax FLOAT(53),
22#         tip_amount FLOAT(53),
23#         tolls_amount FLOAT(53),
24#         improvement_surcharge FLOAT(53),
25#         total_amount FLOAT(53),
26#         congestion_surcharge FLOAT(53)
27# )

Now that we have everything set up, we can read our whole dataset in chunks (so that we don’t load the whole thing in memory, risking to crash the program), and load the chunks into postgres.

 1# now let's read the whole dataset in batches
 2df_iter = pd.read_csv("./yellow_taxi_data.csv.gz", iterator=True, chunksize=100_000)
 3
 4df_chunk: pd.DataFrame = next(df_iter).iloc[:, 1:]
 5df_chunk['tpep_pickup_datetime'] = pd.to_datetime(
 6    df_chunk['tpep_pickup_datetime'])
 7df_chunk['tpep_dropoff_datetime'] = pd.to_datetime(
 8    df_chunk['tpep_dropoff_datetime'])
 9
10# create empty table first
11df_chunk.head(0).to_sql('yellow_taxi_trips', con=con, if_exists='replace')
12
13# now append rows
14df_chunk.to_sql('yellow_taxi_trips', con=con, if_exists='append')
15
16
17for chunk in tqdm(df_iter):
18    df_chunk = chunk.iloc[:, 1:]
19    df_chunk['tpep_pickup_datetime'] = pd.to_datetime(
20        df_chunk['tpep_pickup_datetime'])
21    df_chunk['tpep_dropoff_datetime'] = pd.to_datetime(
22        df_chunk['tpep_dropoff_datetime'])
23    df_chunk.to_sql('yellow_taxi_trips', con=con, if_exists='append')
24
25con.close()

Let’s recreate the script in a more flexible way; we can use the argparse library to parse commandline arguments so that they can be provided at runtime (like username, password, dbname, etc)

 1import pandas as pd
 2from sqlalchemy import create_engine
 3import argparse
 4import os
 5import time
 6
 7
 8def ingest_data(user: str,
 9                password: str,
10                dbname: str,
11                tablename: str,
12                url: str,
13                host: str = "localhost",
14                port: str = "5432") -> None:
15
16    csv_name = url.split("/")[-1]
17    os.system(f"wget {url} -O {csv_name}")
18
19    engine = create_engine(
20        f'postgresql://{user}:{password}@{host}:{port}/{dbname}')
21
22    with engine.connect() as con:
23        df_iter = pd.read_csv(csv_name, iterator=True, chunksize=100_000)
24
25        df_chunk: pd.DataFrame = next(df_iter)
26        df_chunk['tpep_pickup_datetime'] = pd.to_datetime(
27            df_chunk['tpep_pickup_datetime'])
28        df_chunk['tpep_dropoff_datetime'] = pd.to_datetime(
29            df_chunk['tpep_dropoff_datetime'])
30
31        df_chunk.head(0).to_sql(tablename,
32                                con=con, if_exists='replace')
33
34        start = time.time()
35        df_chunk.to_sql(tablename, con=con, if_exists='append')
36        end = time.time()
37        print(f"Inserted chunk in {end-start} seconds")
38
39        for df_chunk in df_iter:
40            df_chunk['tpep_pickup_datetime'] = pd.to_datetime(
41                df_chunk['tpep_pickup_datetime'])
42            df_chunk['tpep_dropoff_datetime'] = pd.to_datetime(
43                df_chunk['tpep_dropoff_datetime'])
44            start = time.time()
45            df_chunk.to_sql(tablename, con=con, if_exists='append')
46            end = time.time()
47            print(f"Inserted chunk in {end-start} seconds")
48
49
50def main():
51    parser = argparse.ArgumentParser(
52        description="Ingest CSV data into Postgres")
53    parser.add_argument("--user", required=True)
54    parser.add_argument("--password", required=True)
55    parser.add_argument("--dbname", required=True)
56    parser.add_argument("--tablename", required=True)
57    parser.add_argument("--url", required=True)
58    parser.add_argument("--host", required=False, default="localhost")
59    parser.add_argument("--port", required=False, default="5432")
60
61    args = parser.parse_args()
62
63    user = args.user
64    password = args.password
65    dbname = args.dbname
66    tablename = args.tablename
67    url = args.url
68    host = args.host
69    port = args.port
70
71    ingest_data(user, password, dbname, tablename, url, host, port)
72
73
74if __name__ == "__main__":
75    main()

Now we can run this script in our environment with:

1python ingest_data.py --user root --password root --dbname ny_taxi --tablename yellow_taxi_trips --host localhost --port 5433 --url https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz

Similarly to what we did for our pipeline.py script, we can dockerize this script so that it is runnable everywhere through docker. Let’s create a Dockerfile that we will use to build an image:

1FROM python:3.9
2
3RUN pip install pandas sqlalchemy psycopg2
4RUN apt install wget
5
6WORKDIR /app
7COPY ingest_data.py ingest_data.py
8
9ENTRYPOINT ["python", "ingest_data.py"]

let’s build it:

1docker build -t ingest:v001 .

However, if we try to run it with

1 docker run -it ingest:v001 --user root --password root --dbname ny_taxi --tablename yellow_taxi_trips --host localhost --port 5433 --url https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz
2
3# sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection to server at "localhost" (127.0.0.1), port 5433 failed: Connection refused
4#        Is the server running on that host and accepting TCP/IP connections?

it fails. This is because docker operates on its own networks. This is why we need to map container ports to host ports when creating a container if we want to be able to expose the port to the host machine.

The postgres container is operating in a isolated docker network, and port 5432 has been mapped to the host 5433 port. The ingest_data docker container is operating in its own isolated docker network, and cannot access the host machine ports, nor the postgres container ports. In order to make the two containers communicate, we need to put them in the same network and assign them names, so that the internal docker DNS can resolve the request to the internal IP address of each container.

Docker Networks

Docker networks are like virtual LANs for your containers.

When you run containers, they need to talk to each other and to the outside world. Docker networks are the plumbing that makes this happen. Think of it like this: you have multiple apps running in separate containers (a web server, a database, a cache). They’re isolated from each other by default. A Docker network is what lets you say “these containers can talk to each other, but that one over there can’t.”

Docker networks allow containers to communicate with each other and the outside world. Here’s what matters:

Network Drivers

Key Concepts

From the Docker documentation:

Container networking refers to the ability for containers to connect to and communicate with each other, and with non-Docker network services.

Containers have networking enabled by default, and they can make outgoing connections. A container has no information about what kind of network it’s attached to, or whether its network peers are also Docker containers. A container only sees a network interface with an IP address, a gateway, a routing table, DNS services, and other networking details.

When Docker Engine on Linux starts for the first time, it has a single built-in network called the “default bridge” network. When you run a container without the –network option, it is connected to the default bridge.

Containers attached to the default bridge have access to network services outside the Docker host. They use “masquerading” which means, if the Docker host has Internet access, no additional configuration is needed for the container to have Internet access.

With the default configuration, containers attached to the default bridge network have unrestricted network access to each other using container IP addresses. They cannot refer to each other by name.

It can be useful to separate groups of containers that should have full access to each other, but restricted access to containers in other groups.

You can create custom, user-defined networks, and connect groups of containers to the same network. Once connected to a user-defined network, containers can communicate with each other using container IP addresses or container names.

Connect postgres container to script container

With this out of the way, we can make the two containers communicate easily by putting them inside their own network. Let’s create a new docker network:

1docker network create pg-network

Now let’s rerun the postgres container specifying the network, and the container name:

1docker run -it -e POSTGRES_USER=root \
2               -e POSTGRES_PASSWORD=root \
3               -e POSTGRES_DB=ny_taxi \
4               -v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data \
5               --network pg-network \
6               --name pgdatabase \
7               -p 5433:5432 \
8               postgres:13

Now we can run our dockerized ingest script, with a couple of changes: the script is now inside the pg-network network; here, the host name of the database is the IP address of the postgres container, which is resolved by the container name pgdatabase. Also, the port is going to be the actual port of postgres inside the container, not the one we expose to the host machine:

 1docker run -it --network pg-network ingest:v001 --user root --password root --dbname ny_taxi --tablename yellow_taxi_trips --host pgdatabase --port 5432 --url https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz
 2
 3# [...]
 4# yellow_tripdata_2 100%[=============>]  23.87M  54.0MB/s    in 0.4s
 5# 
 6# 2025-11-04 11:10:02 (54.0 MB/s) - ‘yellow_tripdata_2021-01.csv.gz’ saved [25031880/25031880]
 7# 
 8# Inserted chunk in 10.459648609161377 seconds
 9# Inserted chunk in 9.797333717346191 seconds
10# Inserted chunk in 9.827284336090088 seconds
11# Inserted chunk in 9.83919906616211 seconds
12# Inserted chunk in 10.140096187591553 seconds
13# Inserted chunk in 9.996657848358154 seconds
14# Inserted chunk in 10.016377210617065 seconds
15# Inserted chunk in 10.257993698120117 seconds
16# Inserted chunk in 9.91151237487793 seconds
17# Inserted chunk in 9.781015634536743 seconds
18# Inserted chunk in 9.51760482788086 seconds
19# Inserted chunk in 10.054514646530151 seconds
20# /app/ingest_data.py:39: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.
21#   for df_chunk in df_iter:
22# Inserted chunk in 10.027701377868652 seconds
23# Inserted chunk in 6.70109748840332 seconds

PgAdmin4

Let’s connect a user interface to our database. PgAdmin4 is a webapp that allows us to connect a UI to a database. It has a docker image, so we can simply spin up a docker container to connect to our postgres database. Since the database is running within the pg-network network, we will need to run the pgadmin container inside it as well:

1docker run -it \
2  -e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \
3  -e PGADMIN_DEFAULT_PASSWORD="root" \
4  -p 8080:80 \
5  --network=pg-network \
6  --name pgadmin-2 \
7  dpage/pgadmin4

Now, we can go to localhost:8080 and be greeted by the login page. Insert credentials, and then right-click on servers, top-right of screen. Click on register server, and name the server something (Docker Postgres for example). Under Connections, use the container name of postgres for hostname, and insert username and password. Click save, and now you can navigate your ny_taxi database, and query the yellow_taxi_trips table.

Docker Compose

docker-compose is a tool that allows us to specify, create and run multiple services (containers) in a single yml script. Services that have been created in the same docker-compose file will also share a network. Let’s create the postgres database and the pgadmin container in a docker-compose script:

 1services:
 2  db:
 3    container_name: pgdatabase
 4    image: postgres:13
 5    environment:
 6      POSTGRES_USER: 'root'
 7      POSTGRES_PASSWORD: 'root'
 8      POSTGRES_DB: 'ny_taxi'
 9    ports:
10      - '5433:5432'
11    volumes:
12      - vol-pgdata:/var/lib/postgresql/data
13
14  pgadmin:
15    container_name: pgadmin
16    image: dpage/pgadmin4:latest
17    environment:
18      PGADMIN_DEFAULT_EMAIL: "admin@admin.com"
19      PGADMIN_DEFAULT_PASSWORD: "root"
20    ports:
21      - "8080:80"
22    volumes:
23      - vol-pgadmin_data:/var/lib/pgadmin  
24
25volumes:
26  vol-pgdata:
27    name: vol-pgdata
28  vol-pgadmin_data:
29    name: vol-pgadmin_data

we can create the services in the background with

1docker compose up -d

you can tear down those services with

1docker compose down

This file also specifies managed volumes in the volumes section; these are similar to binding volumes like we did with the -v flag when creating the postgres database, but in this case the host machine volume is managed entirely by Docker; you don’t need to worry about where to bind the container volume; docker will create a volume on the host machine and assign it the name you specified, and you’ll be able to reference it in the future by that name. You can manage named volume through docker volume

SQL Refresher

Now that we have our postgres database up and running, and a UI with PgAdmin, we can run queries against our yellow_taxi_trips table. Let’s also ingest the taxi_zone_lookup table, and call it zones.

Let’s look at some SQL queries we can make:

Joining Yellow Taxi table with Zones Lookup table (implicit INNER JOIN)

 1SELECT
 2    tpep_pickup_datetime,
 3    tpep_dropoff_datetime,
 4    total_amount,
 5    CONCAT(zpu."Borough", ' | ', zpu."Zone") AS "pickup_loc",
 6    CONCAT(zdo."Borough", ' | ', zdo."Zone") AS "dropoff_loc"
 7FROM 
 8    yellow_taxi_trips t,
 9    zones zpu,
10    zones zdo
11WHERE
12    t."PULocationID" = zpu."LocationID"
13    AND t."DOLocationID" = zdo."LocationID"
14LIMIT 100;

Joining Yellow Taxi table with Zones Lookup table (Explicit INNER JOIN)

 1SELECT
 2    tpep_pickup_datetime,
 3    tpep_dropoff_datetime,
 4    total_amount,
 5    CONCAT(zpu."Borough", ' | ', zpu."Zone") AS "pickup_loc",
 6    CONCAT(zdo."Borough", ' | ', zdo."Zone") AS "dropoff_loc"
 7FROM 
 8    yellow_taxi_trips t
 9JOIN 
10-- or INNER JOIN but it's less used, when writing JOIN, postgreSQL understands implicitly that we want to use an INNER JOIN
11    zones zpu ON t."PULocationID" = zpu."LocationID"
12JOIN
13    zones zdo ON t."DOLocationID" = zdo."LocationID"
14LIMIT 100;

Checking for records with NULL Location IDs in the Yellow Taxi table

 1SELECT
 2    tpep_pickup_datetime,
 3    tpep_dropoff_datetime,
 4    total_amount,
 5    "PULocationID",
 6    "DOLocationID"
 7FROM 
 8    yellow_taxi_trips
 9WHERE
10    "PULocationID" IS NULL
11    OR "DOLocationID" IS NULL
12LIMIT 100;

Checking for Location IDs in the Zones table NOT IN the Yellow Taxi table

 1SELECT
 2    tpep_pickup_datetime,
 3    tpep_dropoff_datetime,
 4    total_amount,
 5    "PULocationID",
 6    "DOLocationID"
 7FROM 
 8    yellow_taxi_trips
 9WHERE
10    "DOLocationID" NOT IN (SELECT "LocationID" from zones)
11    OR "PULocationID" NOT IN (SELECT "LocationID" from zones)
12LIMIT 100;

Using LEFT, RIGHT, and OUTER JOINS when some Location IDs are not in either Tables

 1DELETE FROM zones WHERE "LocationID" = 142;
 2
 3SELECT
 4    tpep_pickup_datetime,
 5    tpep_dropoff_datetime,
 6    total_amount,
 7    CONCAT(zpu."Borough", ' | ', zpu."Zone") AS "pickup_loc",
 8    CONCAT(zdo."Borough", ' | ', zdo."Zone") AS "dropoff_loc"
 9FROM 
10    yellow_taxi_trips t
11LEFT JOIN 
12    zones zpu ON t."PULocationID" = zpu."LocationID"
13JOIN
14    zones zdo ON t."DOLocationID" = zdo."LocationID"
15LIMIT 100;
 1SELECT
 2    tpep_pickup_datetime,
 3    tpep_dropoff_datetime,
 4    total_amount,
 5    CONCAT(zpu."Borough", ' | ', zpu."Zone") AS "pickup_loc",
 6    CONCAT(zdo."Borough", ' | ', zdo."Zone") AS "dropoff_loc"
 7FROM 
 8    yellow_taxi_trips t
 9RIGHT JOIN 
10    zones zpu ON t."PULocationID" = zpu."LocationID"
11JOIN
12    zones zdo ON t."DOLocationID" = zdo."LocationID"
13LIMIT 100;
 1SELECT
 2    tpep_pickup_datetime,
 3    tpep_dropoff_datetime,
 4    total_amount,
 5    CONCAT(zpu."Borough", ' | ', zpu."Zone") AS "pickup_loc",
 6    CONCAT(zdo."Borough", ' | ', zdo."Zone") AS "dropoff_loc"
 7FROM 
 8    yellow_taxi_trips t
 9OUTER JOIN 
10    zones zpu ON t."PULocationID" = zpu."LocationID"
11JOIN
12    zones zdo ON t."DOLocationID" = zdo."LocationID"
13LIMIT 100;

Using GROUP BY to calculate number of trips per day

1SELECT
2    CAST(tpep_dropoff_datetime AS DATE) AS "day",
3    COUNT(1)
4FROM 
5    yellow_taxi_trips
6GROUP BY
7    CAST(tpep_dropoff_datetime AS DATE)
8LIMIT 100;

Using ORDER BY to order the results of your query

 1-- Ordering by day
 2
 3SELECT
 4    CAST(tpep_dropoff_datetime AS DATE) AS "day",
 5    COUNT(1)
 6FROM 
 7    yellow_taxi_trips
 8GROUP BY
 9    CAST(tpep_dropoff_datetime AS DATE)
10ORDER BY
11    "day" ASC
12LIMIT 100;
13
14-- Ordering by count
15
16SELECT
17    CAST(tpep_dropoff_datetime AS DATE) AS "day",
18    COUNT(1) AS "count"
19FROM 
20    yellow_taxi_trips
21GROUP BY
22    CAST(tpep_dropoff_datetime AS DATE)
23ORDER BY
24    "count" DESC
25LIMIT 100;

Other kinds of aggregations

 1SELECT
 2    CAST(tpep_dropoff_datetime AS DATE) AS "day",
 3    COUNT(1) AS "count",
 4    MAX(total_amount) AS "total_amount",
 5    MAX(passenger_count) AS "passenger_count"
 6FROM 
 7    yellow_taxi_trips
 8GROUP BY
 9    CAST(tpep_dropoff_datetime AS DATE)
10ORDER BY
11    "count" DESC
12LIMIT 100;

Grouping by multiple fields

 1SELECT
 2    CAST(tpep_dropoff_datetime AS DATE) AS "day",
 3    "DOLocationID",
 4    COUNT(1) AS "count",
 5    MAX(total_amount) AS "total_amount",
 6    MAX(passenger_count) AS "passenger_count"
 7FROM 
 8    yellow_taxi_trips
 9GROUP BY
10    1, 2
11ORDER BY
12    "day" ASC, 
13    "DOLocationID" ASC
14LIMIT 100;

#data-engineering #study-plan #career-development #zoomcamp