06.07.2023
First experiences with DuckDB (and SQL in general after a while)
My first steps with DuckDB
Disclaimer
I never did hardcore SQL in my career. If you are a (PL/)SQL developer please be kind while judging me for the following content ;)
Setup
Install the DuckDB command line https://duckdb.org/docs/installation/index For the example data, I have taken a CSV export from the Garmin activities https://connect.garmin.com/modern/activities I took this one and not GPX/TPX files because I did not want to jump directly into spatial or deeper analytics during my first steps. In case you don't have data stored on the Garmin page, you can take the content from the end of this post and store it in a file called Activities.csv. If you wonder why the CSV just shows a few entries: You need to scroll to the end of the page or at least so far that you get all the data you want to be in the CSV, what an ugly user experience.
Hello DuckDB
In the folder with the CSV file, call duckdb
to start be ready to go.
First of all let's have a look at the structure of the CSV file.
DESCRIBE SELECT * FROM 'Activities.csv';
will give us an overview of the auto-detected types within the CSV file:
column_name | column_type | null | key | default | extra |
---|---|---|---|---|---|
Activity Type | VARCHAR | YES | |||
Date | TIMESTAMP | YES | |||
Favorite | BOOLEAN | YES | |||
Title | VARCHAR | YES | |||
Distance | VARCHAR | YES | |||
Calories | VARCHAR | YES | |||
Time | TIME | YES | |||
Avg HR | BIGINT | YES | |||
Max HR | BIGINT | YES | |||
Aerobic TE | DOUBLE | YES | |||
Avg Cadence | VARCHAR | YES | |||
Max Cadence | VARCHAR | YES | |||
Avg Pace | VARCHAR | YES | |||
Best Pace | VARCHAR | YES | |||
Total Ascent | VARCHAR | YES | |||
Total Descent | VARCHAR | YES | |||
Avg Stride Length | DOUBLE | YES | |||
Avg Vertical Ratio | DOUBLE | YES | |||
Avg Vertical Oscillation | DOUBLE | YES | |||
Avg Ground Contact Time | BIGINT | YES | |||
Training Stress ScoreĀ® | DOUBLE | YES | |||
Avg Power | BIGINT | YES | |||
Max Power | BIGINT | YES | |||
Grit | DOUBLE | YES | |||
Flow | DOUBLE | YES | |||
Total Strokes | VARCHAR | YES | |||
Avg. Swolf | BIGINT | YES | |||
Avg Stroke Rate | BIGINT | YES | |||
Total Reps | BIGINT | YES | |||
Dive Time | VARCHAR | YES | |||
Min Temp | DOUBLE | YES | |||
Surface Interval | VARCHAR | YES | |||
Decompression | VARCHAR | YES | |||
Best Lap Time | VARCHAR | YES | |||
Number of Laps | BIGINT | YES | |||
Max Temp | DOUBLE | YES | |||
Avg Resp | VARCHAR | YES | |||
Min Resp | VARCHAR | YES | |||
Max Resp | VARCHAR | YES | |||
Moving Time | TIME | YES | |||
Elapsed Time | TIME | YES | |||
Min Elevation | VARCHAR | YES | |||
Max Elevation | VARCHAR | YES |
Looking at the result, the auto-detection of types does a quite good job.
But it is easy to spot that e.g. the elevation columns fall back into a varchar
type instead being numeric.
This is a problem that will get solved later.
Some information that is missing for a proper table entries is an identifier.
To work later with references for joins etc. we should take this CSV file as the source for a new table with an id based on the uuid()
function for each entry to get this additional field.
CREATE TABLE activity_import AS SELECT uuid() AS id, * FROM 'Activities.csv';
Activity table
Since not every activity has all values present, the data will get imported in multiple tables. The first one would be some kind of base data table, where Activity Type, Date and Title get stored. To derive the new table from the import table, the statement
CREATE TABLE activity AS SELECT id, "Activity Type", "Date", "Title" FROM activity_import;
Verifying that the expected data is in the table with
SELECT * FROM ACTIVITY LIMIT 10;
produces
id | Activity Type | Date | Title |
---|---|---|---|
14cd31a9-868b-4d3d-a507-373850d90ac7 | Walking | 2023-07-04 08:01:08 | Brunswick Walking |
2e5480b4-4b2f-4dea-a340-515b04fa2083 | Walking | 2023-07-02 08:18:50 | Brunswick Walking |
19663dd5-d105-4471-91e6-69e85400ad5f | Walking | 2023-06-30 15:54:38 | Brunswick Walking |
596c3c8d-ae86-4953-bd7f-258d23ea2726 | Running | 2023-06-28 07:52:00 | Brunswick Running |
e71833bb-f4c3-4e4e-ac02-6d183d202e0d | Running | 2023-06-26 07:52:45 | Brunswick Running |
419acad8-2ca4-48a1-8f1a-9fc7c4263918 | Running | 2023-06-22 07:53:57 | Brunswick Running |
a6f9a970-80f4-4bc0-ad95-3e38a7b5c3e4 | Pool Swimming | 2023-06-17 11:39:30 | Pool Swimming |
7c687f12-492f-49d2-9647-68c063816e4f | Running | 2023-05-28 12:20:50 | Heiligenhafen Running |
b4ae30e8-d498-4448-866a-184b3f8245b4 | Running | 2023-05-27 07:49:10 | Heiligenhafen Running |
faedde1e-295d-4566-9d09-1358a9adfa1a | Walking | 2023-04-14 15:53:40 | Brunswick Walking |
Elevation Comparison and Type conversion
Now let's take on the problem with the varchar
for the elevation columns that we saw in the beginning.
To figure out which values are non-numeric in the Min Elevation column, run:
SELECT "Min Elevation" FROM activity_import WHERE try_cast("Min Elevation" AS NUMERIC) IS NULL ORDER BY "Min Elevation";
That will uncover the "faulty" entries.
Min Elevation |
---|
-- |
-- |
-- |
-- |
-- |
-- |
-- |
1,246 |
1,246 |
1,269 |
1,312 |
1,318 |
1,321 |
1,353 |
1,412 |
1,448 |
1,449 |
1,449 |
1,458 |
1,475 |
1,476 |
1,476 |
1,476 |
1,476 |
1,476 |
1,477 |
1,478 |
1,478 |
1,478 |
1,478 |
1,478 |
1,483 |
1,494 |
1,524 |
There seem to be some empty fields indicated with --
and the thousands marker cannot be parsed.
At least a pattern now to filter and alter the values before we import them into a table.
To convert the column into a numeric type, the pre-processing is done this way case when "Min Elevation"=='--' THEN 0 ELSE replace("Min Elevation", ',', '')::numeric END
.
Now a table activities_elevation
can be created combining the statement above with the CREATE
statement.
Also, the conversion for the Max Elevation will get added to the new table because it is the very same as for the Min Elevation.
CREATE TABLE activity_elevation AS SELECT id,
case when "Min Elevation"=='--' THEN 0 ELSE replace("Min Elevation", ',', '')::numeric END as "Min Elevation",
case when "Max Elevation"=='--' THEN 0 ELSE replace("Max Elevation", ',', '')::numeric END as "Max Elevation"
FROM activity_import;
Querying
The question that I asked myself first after having a glance at the data was: "Which activity had the most span between min and max elevation". Of course in the real world this does not cover those activities with a lot of smaller ascents that repeat over the track. The query
SELECT "Activity Type", "Date", "Max Elevation", "Min Elevation", "Max Elevation" - "Min Elevation" AS difference
FROM activity_elevation
JOIN activity USING (id)
ORDER BY difference DESC
LIMIT 10;
returns
Activity Type | Date | Max Elevation | Min Elevation | difference |
---|---|---|---|---|
Hiking | 2022-08-01 11:02:30 | 2356.000 | 1353.000 | 1003.000 |
Cycling | 2020-08-09 09:47:40 | 1924.000 | 1269.000 | 655.000 |
Cycling | 2022-08-05 13:23:41 | 1953.000 | 1412.000 | 541.000 |
Cycling | 2022-08-05 10:04:12 | 1684.000 | 1246.000 | 438.000 |
Walking | 2021-08-05 11:13:26 | 1361.000 | 938.000 | 423.000 |
Hiking | 2020-08-15 14:58:34 | 1940.000 | 1524.000 | 416.000 |
Cycling | 2020-09-04 09:24:56 | 538.000 | 137.000 | 401.000 |
Hiking | 2022-08-09 09:23:26 | 1783.000 | 1477.000 | 306.000 |
Hiking | 2022-08-07 09:35:59 | 1731.000 | 1448.000 | 283.000 |
Cycling | 2021-08-11 09:40:48 | 1524.000 | 1246.000 | 278.000 |
Those are my Top10 activities regarding elevation. And what surprises me the most: The join with the directive USING (id)
just works as expected by matching the id columns of both tables.
Table-less
You might want to do everything in-place and just use DuckDB as a fancy CSV tool. The whole steps above can also be combined into a single statement:
SELECT "Activity Type", "Date",
case when "Min Elevation"=='--' THEN 0 ELSE replace("Min Elevation", ',', '')::numeric END as "MinElevation",
case when "Max Elevation"=='--' THEN 0 ELSE replace("Max Elevation", ',', '')::numeric END as "MaxElevation",
"MaxElevation" - "MinElevation" AS difference
from 'Activities.csv'
ORDER BY difference DESC
LIMIT 10;
And here is the statement that I have used to create the example CSV file from my original export.
COPY (SELECT "Activity Type", "Min Elevation", "Max Elevation",
concat("Activity Type", ' on ', times.Date::varchar) as Title,
times.Date as Date
from 'Activities_original.csv' TABLESAMPLE 20 POSITIONAL JOIN
(SELECT generate_series as Date from generate_series(TIMESTAMP '2023-04-10 08:00:00', TIMESTAMP '2023-06-11', INTERVAL 36 HOUR) LIMIT 20) as times
)
TO 'Activities.csv' WITH (HEADER 1);
Activities.csv
Activity Type,Min Elevation,Max Elevation,Title,Date
Running,75,102,Running on 2023-04-10 08:00:00,2023-04-10 08:00:00
Running,"1,478","1,547",Running on 2023-04-11 20:00:00,2023-04-11 20:00:00
Cycling,65,94,Cycling on 2023-04-13 08:00:00,2023-04-13 08:00:00
Cycling,"1,246","1,684",Cycling on 2023-04-14 20:00:00,2023-04-14 20:00:00
Running,74,107,Running on 2023-04-16 08:00:00,2023-04-16 08:00:00
Hiking,"1,483","1,721",Hiking on 2023-04-17 20:00:00,2023-04-17 20:00:00
Running,-8,8,Running on 2023-04-19 08:00:00,2023-04-19 08:00:00
Walking,87,96,Walking on 2023-04-20 20:00:00,2023-04-20 20:00:00
Running,74,107,Running on 2023-04-22 08:00:00,2023-04-22 08:00:00
Running,74,107,Running on 2023-04-23 20:00:00,2023-04-23 20:00:00
Running,74,107,Running on 2023-04-25 08:00:00,2023-04-25 08:00:00
Running,75,102,Running on 2023-04-26 20:00:00,2023-04-26 20:00:00
Running,70,107,Running on 2023-04-28 08:00:00,2023-04-28 08:00:00
Gravel/Unpaved Cycling,76,93,Gravel/Unpaved Cycling on 2023-04-29 20:00:00,2023-04-29 20:00:00
Pool Swimming,--,--,Pool Swimming on 2023-05-01 08:00:00,2023-05-01 08:00:00
Cycling,137,538,Cycling on 2023-05-02 20:00:00,2023-05-02 20:00:00
Walking,964,"1,085",Walking on 2023-05-04 08:00:00,2023-05-04 08:00:00
Walking,75,102,Walking on 2023-05-05 20:00:00,2023-05-05 20:00:00
Cycling,77,112,Cycling on 2023-05-07 08:00:00,2023-05-07 08:00:00
Running,"1,449","1,545",Running on 2023-05-08 20:00:00,2023-05-08 20:00:00