Gerrit's Blog

06.07.2023

First experiences with DuckDB (and SQL in general after a while)

My first steps with DuckDB

Disclaimer

I never did hardcore SQL in my career. If you are a (PL/)SQL developer please be kind while judging me for the following content ;)

Setup

Install the DuckDB command line https://duckdb.org/docs/installation/index For the example data, I have taken a CSV export from the Garmin activities https://connect.garmin.com/modern/activities I took this one and not GPX/TPX files because I did not want to jump directly into spatial or deeper analytics during my first steps. In case you don't have data stored on the Garmin page, you can take the content from the end of this post and store it in a file called Activities.csv. If you wonder why the CSV just shows a few entries: You need to scroll to the end of the page or at least so far that you get all the data you want to be in the CSV, what an ugly user experience.

Hello DuckDB

In the folder with the CSV file, call duckdb to start be ready to go. First of all let's have a look at the structure of the CSV file.

DESCRIBE SELECT * FROM 'Activities.csv';

will give us an overview of the auto-detected types within the CSV file:

column_name column_type null key default extra
Activity Type VARCHAR YES
Date TIMESTAMP YES
Favorite BOOLEAN YES
Title VARCHAR YES
Distance VARCHAR YES
Calories VARCHAR YES
Time TIME YES
Avg HR BIGINT YES
Max HR BIGINT YES
Aerobic TE DOUBLE YES
Avg Cadence VARCHAR YES
Max Cadence VARCHAR YES
Avg Pace VARCHAR YES
Best Pace VARCHAR YES
Total Ascent VARCHAR YES
Total Descent VARCHAR YES
Avg Stride Length DOUBLE YES
Avg Vertical Ratio DOUBLE YES
Avg Vertical Oscillation DOUBLE YES
Avg Ground Contact Time BIGINT YES
Training Stress ScoreĀ® DOUBLE YES
Avg Power BIGINT YES
Max Power BIGINT YES
Grit DOUBLE YES
Flow DOUBLE YES
Total Strokes VARCHAR YES
Avg. Swolf BIGINT YES
Avg Stroke Rate BIGINT YES
Total Reps BIGINT YES
Dive Time VARCHAR YES
Min Temp DOUBLE YES
Surface Interval VARCHAR YES
Decompression VARCHAR YES
Best Lap Time VARCHAR YES
Number of Laps BIGINT YES
Max Temp DOUBLE YES
Avg Resp VARCHAR YES
Min Resp VARCHAR YES
Max Resp VARCHAR YES
Moving Time TIME YES
Elapsed Time TIME YES
Min Elevation VARCHAR YES
Max Elevation VARCHAR YES

Looking at the result, the auto-detection of types does a quite good job. But it is easy to spot that e.g. the elevation columns fall back into a varchar type instead being numeric. This is a problem that will get solved later.

Some information that is missing for a proper table entries is an identifier. To work later with references for joins etc. we should take this CSV file as the source for a new table with an id based on the uuid() function for each entry to get this additional field.

CREATE TABLE activity_import AS SELECT uuid() AS id, * FROM 'Activities.csv';

Activity table

Since not every activity has all values present, the data will get imported in multiple tables. The first one would be some kind of base data table, where Activity Type, Date and Title get stored. To derive the new table from the import table, the statement

CREATE TABLE activity AS SELECT id, "Activity Type", "Date", "Title" FROM activity_import;

Verifying that the expected data is in the table with

SELECT * FROM ACTIVITY LIMIT 10;

produces

id Activity Type Date Title
14cd31a9-868b-4d3d-a507-373850d90ac7 Walking 2023-07-04 08:01:08 Brunswick Walking
2e5480b4-4b2f-4dea-a340-515b04fa2083 Walking 2023-07-02 08:18:50 Brunswick Walking
19663dd5-d105-4471-91e6-69e85400ad5f Walking 2023-06-30 15:54:38 Brunswick Walking
596c3c8d-ae86-4953-bd7f-258d23ea2726 Running 2023-06-28 07:52:00 Brunswick Running
e71833bb-f4c3-4e4e-ac02-6d183d202e0d Running 2023-06-26 07:52:45 Brunswick Running
419acad8-2ca4-48a1-8f1a-9fc7c4263918 Running 2023-06-22 07:53:57 Brunswick Running
a6f9a970-80f4-4bc0-ad95-3e38a7b5c3e4 Pool Swimming 2023-06-17 11:39:30 Pool Swimming
7c687f12-492f-49d2-9647-68c063816e4f Running 2023-05-28 12:20:50 Heiligenhafen Running
b4ae30e8-d498-4448-866a-184b3f8245b4 Running 2023-05-27 07:49:10 Heiligenhafen Running
faedde1e-295d-4566-9d09-1358a9adfa1a Walking 2023-04-14 15:53:40 Brunswick Walking

Elevation Comparison and Type conversion

Now let's take on the problem with the varchar for the elevation columns that we saw in the beginning. To figure out which values are non-numeric in the Min Elevation column, run:

SELECT "Min Elevation" FROM activity_import WHERE try_cast("Min Elevation" AS NUMERIC) IS NULL ORDER BY "Min Elevation";

That will uncover the "faulty" entries.

Min Elevation
--
--
--
--
--
--
--
1,246
1,246
1,269
1,312
1,318
1,321
1,353
1,412
1,448
1,449
1,449
1,458
1,475
1,476
1,476
1,476
1,476
1,476
1,477
1,478
1,478
1,478
1,478
1,478
1,483
1,494
1,524

There seem to be some empty fields indicated with -- and the thousands marker cannot be parsed. At least a pattern now to filter and alter the values before we import them into a table.

To convert the column into a numeric type, the pre-processing is done this way case when "Min Elevation"=='--' THEN 0 ELSE replace("Min Elevation", ',', '')::numeric END. Now a table activities_elevation can be created combining the statement above with the CREATE statement. Also, the conversion for the Max Elevation will get added to the new table because it is the very same as for the Min Elevation.

CREATE TABLE activity_elevation AS SELECT id,
  case when "Min Elevation"=='--' THEN 0 ELSE replace("Min Elevation", ',', '')::numeric END as "Min Elevation",
  case when "Max Elevation"=='--' THEN 0 ELSE replace("Max Elevation", ',', '')::numeric END as "Max Elevation"
  FROM activity_import;

Querying

The question that I asked myself first after having a glance at the data was: "Which activity had the most span between min and max elevation". Of course in the real world this does not cover those activities with a lot of smaller ascents that repeat over the track. The query

SELECT "Activity Type", "Date", "Max Elevation", "Min Elevation", "Max Elevation" - "Min Elevation" AS difference 
	FROM activity_elevation 
	JOIN activity USING (id) 
ORDER BY difference DESC
LIMIT 10;

returns

Activity Type Date Max Elevation Min Elevation difference
Hiking 2022-08-01 11:02:30 2356.000 1353.000 1003.000
Cycling 2020-08-09 09:47:40 1924.000 1269.000 655.000
Cycling 2022-08-05 13:23:41 1953.000 1412.000 541.000
Cycling 2022-08-05 10:04:12 1684.000 1246.000 438.000
Walking 2021-08-05 11:13:26 1361.000 938.000 423.000
Hiking 2020-08-15 14:58:34 1940.000 1524.000 416.000
Cycling 2020-09-04 09:24:56 538.000 137.000 401.000
Hiking 2022-08-09 09:23:26 1783.000 1477.000 306.000
Hiking 2022-08-07 09:35:59 1731.000 1448.000 283.000
Cycling 2021-08-11 09:40:48 1524.000 1246.000 278.000

Those are my Top10 activities regarding elevation. And what surprises me the most: The join with the directive USING (id) just works as expected by matching the id columns of both tables.

Table-less

You might want to do everything in-place and just use DuckDB as a fancy CSV tool. The whole steps above can also be combined into a single statement:

SELECT "Activity Type", "Date",
	case when "Min Elevation"=='--' THEN 0 ELSE replace("Min Elevation", ',', '')::numeric END as "MinElevation",
	case when "Max Elevation"=='--' THEN 0 ELSE replace("Max Elevation", ',', '')::numeric END as "MaxElevation",
	"MaxElevation" - "MinElevation" AS difference
	from 'Activities.csv'
ORDER BY difference DESC 
LIMIT 10;

And here is the statement that I have used to create the example CSV file from my original export.

COPY (SELECT "Activity Type", "Min Elevation", "Max Elevation",
	concat("Activity Type", ' on ', times.Date::varchar) as Title,
	times.Date as Date
	from 'Activities_original.csv' TABLESAMPLE 20 POSITIONAL JOIN 
		(SELECT generate_series as Date from generate_series(TIMESTAMP '2023-04-10 08:00:00', TIMESTAMP '2023-06-11', INTERVAL 36 HOUR) LIMIT 20) as times
	)
TO 'Activities.csv' WITH (HEADER 1);

Activities.csv

Activity Type,Min Elevation,Max Elevation,Title,Date
Running,75,102,Running on 2023-04-10 08:00:00,2023-04-10 08:00:00
Running,"1,478","1,547",Running on 2023-04-11 20:00:00,2023-04-11 20:00:00
Cycling,65,94,Cycling on 2023-04-13 08:00:00,2023-04-13 08:00:00
Cycling,"1,246","1,684",Cycling on 2023-04-14 20:00:00,2023-04-14 20:00:00
Running,74,107,Running on 2023-04-16 08:00:00,2023-04-16 08:00:00
Hiking,"1,483","1,721",Hiking on 2023-04-17 20:00:00,2023-04-17 20:00:00
Running,-8,8,Running on 2023-04-19 08:00:00,2023-04-19 08:00:00
Walking,87,96,Walking on 2023-04-20 20:00:00,2023-04-20 20:00:00
Running,74,107,Running on 2023-04-22 08:00:00,2023-04-22 08:00:00
Running,74,107,Running on 2023-04-23 20:00:00,2023-04-23 20:00:00
Running,74,107,Running on 2023-04-25 08:00:00,2023-04-25 08:00:00
Running,75,102,Running on 2023-04-26 20:00:00,2023-04-26 20:00:00
Running,70,107,Running on 2023-04-28 08:00:00,2023-04-28 08:00:00
Gravel/Unpaved Cycling,76,93,Gravel/Unpaved Cycling on 2023-04-29 20:00:00,2023-04-29 20:00:00
Pool Swimming,--,--,Pool Swimming on 2023-05-01 08:00:00,2023-05-01 08:00:00
Cycling,137,538,Cycling on 2023-05-02 20:00:00,2023-05-02 20:00:00
Walking,964,"1,085",Walking on 2023-05-04 08:00:00,2023-05-04 08:00:00
Walking,75,102,Walking on 2023-05-05 20:00:00,2023-05-05 20:00:00
Cycling,77,112,Cycling on 2023-05-07 08:00:00,2023-05-07 08:00:00
Running,"1,449","1,545",Running on 2023-05-08 20:00:00,2023-05-08 20:00:00