Gerrit's Blog

Thoughts and ideas, mostly tech based. But I don't promise anything.

06.07.2023

First experiences with DuckDB (and SQL in general after a while)

Disclaimer

I never did hardcore SQL in my career. If you are a (PL/)SQL developer please be kind while judging me for the following content ;)

Setup

Install the DuckDB command line https://duckdb.org/docs/installation/index For the example data, I have taken a CSV export from the Garmin activities https://connect.garmin.com/modern/activities I took this one and not GPX/TPX files because I did not want to jump directly into spatial or deeper analytics during my first steps. In case you don’t have data stored on the Garmin page, you can take the content from the end of this post and store it in a file called Activities.csv. If you wonder why the CSV just shows a few entries: You need to scroll to the end of the page or at least so far that you get all the data you want to be in the CSV, what an ugly user experience.

Hello DuckDB

In the folder with the CSV file, call duckdb to start be ready to go. First of all let’s have a look at the structure of the CSV file.

DESCRIBE SELECT * FROM 'Activities.csv';

will give us an overview of the auto-detected types within the CSV file:

column_name column_type null key default extra
Activity Type VARCHAR YES      
Date TIMESTAMP YES      
Favorite BOOLEAN YES      
Title VARCHAR YES      
Distance VARCHAR YES      
Calories VARCHAR YES      
Time TIME YES      
Avg HR BIGINT YES      
Max HR BIGINT YES      
Aerobic TE DOUBLE YES      
Avg Cadence VARCHAR YES      
Max Cadence VARCHAR YES      
Avg Pace VARCHAR YES      
Best Pace VARCHAR YES      
Total Ascent VARCHAR YES      
Total Descent VARCHAR YES      
Avg Stride Length DOUBLE YES      
Avg Vertical Ratio DOUBLE YES      
Avg Vertical Oscillation DOUBLE YES      
Avg Ground Contact Time BIGINT YES      
Training Stress Score® DOUBLE YES      
Avg Power BIGINT YES      
Max Power BIGINT YES      
Grit DOUBLE YES      
Flow DOUBLE YES      
Total Strokes VARCHAR YES      
Avg. Swolf BIGINT YES      
Avg Stroke Rate BIGINT YES      
Total Reps BIGINT YES      
Dive Time VARCHAR YES      
Min Temp DOUBLE YES      
Surface Interval VARCHAR YES      
Decompression VARCHAR YES      
Best Lap Time VARCHAR YES      
Number of Laps BIGINT YES      
Max Temp DOUBLE YES      
Avg Resp VARCHAR YES      
Min Resp VARCHAR YES      
Max Resp VARCHAR YES      
Moving Time TIME YES      
Elapsed Time TIME YES      
Min Elevation VARCHAR YES      
Max Elevation VARCHAR YES      

Looking at the result, the auto-detection of types does a quite good job. But it is easy to spot that e.g. the elevation columns fall back into a varchar type instead being numeric. This is a problem that will get solved later.

Some information that is missing for a proper table entries is an identifier. To work later with references for joins etc. we should take this CSV file as the source for a new table with an id based on the uuid() function for each entry to get this additional field.

CREATE TABLE activity_import AS SELECT uuid() AS id, * FROM 'Activities.csv';

Activity table

Since not every activity has all values present, the data will get imported in multiple tables. The first one would be some kind of base data table, where Activity Type, Date and Title get stored. To derive the new table from the import table, the statement

CREATE TABLE activity AS SELECT id, "Activity Type", "Date", "Title" FROM activity_import;

Verifying that the expected data is in the table with

SELECT * FROM ACTIVITY LIMIT 10;

produces

id Activity Type Date Title
14cd31a9-868b-4d3d-a507-373850d90ac7 Walking 2023-07-04 08:01:08 Brunswick Walking
2e5480b4-4b2f-4dea-a340-515b04fa2083 Walking 2023-07-02 08:18:50 Brunswick Walking
19663dd5-d105-4471-91e6-69e85400ad5f Walking 2023-06-30 15:54:38 Brunswick Walking
596c3c8d-ae86-4953-bd7f-258d23ea2726 Running 2023-06-28 07:52:00 Brunswick Running
e71833bb-f4c3-4e4e-ac02-6d183d202e0d Running 2023-06-26 07:52:45 Brunswick Running
419acad8-2ca4-48a1-8f1a-9fc7c4263918 Running 2023-06-22 07:53:57 Brunswick Running
a6f9a970-80f4-4bc0-ad95-3e38a7b5c3e4 Pool Swimming 2023-06-17 11:39:30 Pool Swimming
7c687f12-492f-49d2-9647-68c063816e4f Running 2023-05-28 12:20:50 Heiligenhafen Running
b4ae30e8-d498-4448-866a-184b3f8245b4 Running 2023-05-27 07:49:10 Heiligenhafen Running
faedde1e-295d-4566-9d09-1358a9adfa1a Walking 2023-04-14 15:53:40 Brunswick Walking

Elevation Comparison and Type conversion

Now let’s take on the problem with the varchar for the elevation columns that we saw in the beginning. To figure out which values are non-numeric in the Min Elevation column, run:

SELECT "Min Elevation" FROM activity_import WHERE try_cast("Min Elevation" AS NUMERIC) IS NULL ORDER BY "Min Elevation";

That will uncover the “faulty” entries.

Min Elevation
1,246
1,246
1,269
1,312
1,318
1,321
1,353
1,412
1,448
1,449
1,449
1,458
1,475
1,476
1,476
1,476
1,476
1,476
1,477
1,478
1,478
1,478
1,478
1,478
1,483
1,494
1,524

There seem to be some empty fields indicated with -- and the thousands marker cannot be parsed. At least a pattern now to filter and alter the values before we import them into a table.

To convert the column into a numeric type, the pre-processing is done this way case when "Min Elevation"=='--' THEN 0 ELSE replace("Min Elevation", ',', '')::numeric END. Now a table activities_elevation can be created combining the statement above with the CREATE statement. Also, the conversion for the Max Elevation will get added to the new table because it is the very same as for the Min Elevation.

CREATE TABLE activity_elevation AS SELECT id,
  case when "Min Elevation"=='--' THEN 0 ELSE replace("Min Elevation", ',', '')::numeric END as "Min Elevation",
  case when "Max Elevation"=='--' THEN 0 ELSE replace("Max Elevation", ',', '')::numeric END as "Max Elevation"
  FROM activity_import;

Querying

The question that I asked myself first after having a glance at the data was: “Which activity had the most span between min and max elevation”. Of course in the real world this does not cover those activities with a lot of smaller ascents that repeat over the track. The query

SELECT "Activity Type", "Date", "Max Elevation", "Min Elevation", "Max Elevation" - "Min Elevation" AS difference 
	FROM activity_elevation 
	JOIN activity USING (id) 
ORDER BY difference DESC
LIMIT 10;

returns

Activity Type Date Max Elevation Min Elevation difference
Hiking 2022-08-01 11:02:30 2356.000 1353.000 1003.000
Cycling 2020-08-09 09:47:40 1924.000 1269.000 655.000
Cycling 2022-08-05 13:23:41 1953.000 1412.000 541.000
Cycling 2022-08-05 10:04:12 1684.000 1246.000 438.000
Walking 2021-08-05 11:13:26 1361.000 938.000 423.000
Hiking 2020-08-15 14:58:34 1940.000 1524.000 416.000
Cycling 2020-09-04 09:24:56 538.000 137.000 401.000
Hiking 2022-08-09 09:23:26 1783.000 1477.000 306.000
Hiking 2022-08-07 09:35:59 1731.000 1448.000 283.000
Cycling 2021-08-11 09:40:48 1524.000 1246.000 278.000

Those are my Top10 activities regarding elevation. And what surprises me the most: The join with the directive USING (id) just works as expected by matching the id columns of both tables.

Table-less

You might want to do everything in-place and just use DuckDB as a fancy CSV tool. The whole steps above can also be combined into a single statement:

SELECT "Activity Type", "Date",
	case when "Min Elevation"=='--' THEN 0 ELSE replace("Min Elevation", ',', '')::numeric END as "MinElevation",
	case when "Max Elevation"=='--' THEN 0 ELSE replace("Max Elevation", ',', '')::numeric END as "MaxElevation",
	"MaxElevation" - "MinElevation" AS difference
	from 'Activities.csv'
ORDER BY difference DESC 
LIMIT 10;

And here is the statement that I have used to create the example CSV file from my original export.

COPY (SELECT "Activity Type", "Min Elevation", "Max Elevation",
	concat("Activity Type", ' on ', times.Date::varchar) as Title,
	times.Date as Date
	from 'Activities_original.csv' TABLESAMPLE 20 POSITIONAL JOIN 
		(SELECT generate_series as Date from generate_series(TIMESTAMP '2023-04-10 08:00:00', TIMESTAMP '2023-06-11', INTERVAL 36 HOUR) LIMIT 20) as times
	)
TO 'Activities.csv' WITH (HEADER 1);

Activities.csv

Activity Type,Min Elevation,Max Elevation,Title,Date
Running,75,102,Running on 2023-04-10 08:00:00,2023-04-10 08:00:00
Running,"1,478","1,547",Running on 2023-04-11 20:00:00,2023-04-11 20:00:00
Cycling,65,94,Cycling on 2023-04-13 08:00:00,2023-04-13 08:00:00
Cycling,"1,246","1,684",Cycling on 2023-04-14 20:00:00,2023-04-14 20:00:00
Running,74,107,Running on 2023-04-16 08:00:00,2023-04-16 08:00:00
Hiking,"1,483","1,721",Hiking on 2023-04-17 20:00:00,2023-04-17 20:00:00
Running,-8,8,Running on 2023-04-19 08:00:00,2023-04-19 08:00:00
Walking,87,96,Walking on 2023-04-20 20:00:00,2023-04-20 20:00:00
Running,74,107,Running on 2023-04-22 08:00:00,2023-04-22 08:00:00
Running,74,107,Running on 2023-04-23 20:00:00,2023-04-23 20:00:00
Running,74,107,Running on 2023-04-25 08:00:00,2023-04-25 08:00:00
Running,75,102,Running on 2023-04-26 20:00:00,2023-04-26 20:00:00
Running,70,107,Running on 2023-04-28 08:00:00,2023-04-28 08:00:00
Gravel/Unpaved Cycling,76,93,Gravel/Unpaved Cycling on 2023-04-29 20:00:00,2023-04-29 20:00:00
Pool Swimming,--,--,Pool Swimming on 2023-05-01 08:00:00,2023-05-01 08:00:00
Cycling,137,538,Cycling on 2023-05-02 20:00:00,2023-05-02 20:00:00
Walking,964,"1,085",Walking on 2023-05-04 08:00:00,2023-05-04 08:00:00
Walking,75,102,Walking on 2023-05-05 20:00:00,2023-05-05 20:00:00
Cycling,77,112,Cycling on 2023-05-07 08:00:00,2023-05-07 08:00:00
Running,"1,449","1,545",Running on 2023-05-08 20:00:00,2023-05-08 20:00:00