06.07.2023

First experiences with DuckDB (and SQL in general after a while)

My first steps with DuckDB

Disclaimer

I never did hardcore SQL in my career. If you are a (PL/)SQL developer please be kind while judging me for the following content ;)

Setup

Install the DuckDB command line https://duckdb.org/docs/installation/index For the example data, I have taken a CSV export from the Garmin activities https://connect.garmin.com/modern/activities I took this one and not GPX/TPX files because I did not want to jump directly into spatial or deeper analytics during my first steps. In case you don't have data stored on the Garmin page, you can take the content from the end of this post and store it in a file called Activities.csv. If you wonder why the CSV just shows a few entries: You need to scroll to the end of the page or at least so far that you get all the data you want to be in the CSV, what an ugly user experience.

Hello DuckDB

In the folder with the CSV file, call duckdb to start be ready to go. First of all let's have a look at the structure of the CSV file.

DESCRIBE SELECT * FROM 'Activities.csv';

will give us an overview of the auto-detected types within the CSV file:

column_name	column_type	null
Activity Type	VARCHAR	YES
Date	TIMESTAMP	YES
Favorite	BOOLEAN	YES
Title	VARCHAR	YES
Distance	VARCHAR	YES
Calories	VARCHAR	YES
Time	TIME	YES
Avg HR	BIGINT	YES
Max HR	BIGINT	YES
Aerobic TE	DOUBLE	YES
Avg Cadence	VARCHAR	YES
Max Cadence	VARCHAR	YES
Avg Pace	VARCHAR	YES
Best Pace	VARCHAR	YES
Total Ascent	VARCHAR	YES
Total Descent	VARCHAR	YES
Avg Stride Length	DOUBLE	YES
Avg Vertical Ratio	DOUBLE	YES
Avg Vertical Oscillation	DOUBLE	YES
Avg Ground Contact Time	BIGINT	YES
Training Stress Score®	DOUBLE	YES
Avg Power	BIGINT	YES
Max Power	BIGINT	YES
Grit	DOUBLE	YES
Flow	DOUBLE	YES
Total Strokes	VARCHAR	YES
Avg. Swolf	BIGINT	YES
Avg Stroke Rate	BIGINT	YES
Total Reps	BIGINT	YES
Dive Time	VARCHAR	YES
Min Temp	DOUBLE	YES
Surface Interval	VARCHAR	YES
Decompression	VARCHAR	YES
Best Lap Time	VARCHAR	YES
Number of Laps	BIGINT	YES
Max Temp	DOUBLE	YES
Avg Resp	VARCHAR	YES
Min Resp	VARCHAR	YES
Max Resp	VARCHAR	YES
Moving Time	TIME	YES
Elapsed Time	TIME	YES
Min Elevation	VARCHAR	YES
Max Elevation	VARCHAR	YES

Looking at the result, the auto-detection of types does a quite good job. But it is easy to spot that e.g. the elevation columns fall back into a varchar type instead being numeric. This is a problem that will get solved later.

Some information that is missing for a proper table entries is an identifier. To work later with references for joins etc. we should take this CSV file as the source for a new table with an id based on the uuid() function for each entry to get this additional field.

CREATE TABLE activity_import AS SELECT uuid() AS id, * FROM 'Activities.csv';

Activity table

Since not every activity has all values present, the data will get imported in multiple tables. The first one would be some kind of base data table, where Activity Type, Date and Title get stored. To derive the new table from the import table, the statement

CREATE TABLE activity AS SELECT id, "Activity Type", "Date", "Title" FROM activity_import;

Verifying that the expected data is in the table with

SELECT * FROM ACTIVITY LIMIT 10;

produces

id	Activity Type	Date	Title
14cd31a9-868b-4d3d-a507-373850d90ac7	Walking	2023-07-04 08:01:08	Brunswick Walking
2e5480b4-4b2f-4dea-a340-515b04fa2083	Walking	2023-07-02 08:18:50	Brunswick Walking
19663dd5-d105-4471-91e6-69e85400ad5f	Walking	2023-06-30 15:54:38	Brunswick Walking
596c3c8d-ae86-4953-bd7f-258d23ea2726	Running	2023-06-28 07:52:00	Brunswick Running
e71833bb-f4c3-4e4e-ac02-6d183d202e0d	Running	2023-06-26 07:52:45	Brunswick Running
419acad8-2ca4-48a1-8f1a-9fc7c4263918	Running	2023-06-22 07:53:57	Brunswick Running
a6f9a970-80f4-4bc0-ad95-3e38a7b5c3e4	Pool Swimming	2023-06-17 11:39:30	Pool Swimming
7c687f12-492f-49d2-9647-68c063816e4f	Running	2023-05-28 12:20:50	Heiligenhafen Running
b4ae30e8-d498-4448-866a-184b3f8245b4	Running	2023-05-27 07:49:10	Heiligenhafen Running
faedde1e-295d-4566-9d09-1358a9adfa1a	Walking	2023-04-14 15:53:40	Brunswick Walking

Elevation Comparison and Type conversion

Now let's take on the problem with the varchar for the elevation columns that we saw in the beginning. To figure out which values are non-numeric in the Min Elevation column, run:

SELECT "Min Elevation" FROM activity_import WHERE try_cast("Min Elevation" AS NUMERIC) IS NULL ORDER BY "Min Elevation";

That will uncover the "faulty" entries.

Min Elevation
--
--
--
--
--
--
--
1,246
1,246
1,269
1,312
1,318
1,321
1,353
1,412
1,448
1,449
1,449
1,458
1,475
1,476
1,476
1,476
1,476
1,476
1,477
1,478
1,478
1,478
1,478
1,478
1,483
1,494
1,524

There seem to be some empty fields indicated with -- and the thousands marker cannot be parsed. At least a pattern now to filter and alter the values before we import them into a table.

To convert the column into a numeric type, the pre-processing is done this way case when "Min Elevation"=='--' THEN 0 ELSE replace("Min Elevation", ',', '')::numeric END. Now a table activities_elevation can be created combining the statement above with the CREATE statement. Also, the conversion for the Max Elevation will get added to the new table because it is the very same as for the Min Elevation.

CREATE TABLE activity_elevation AS SELECT id,
  case when "Min Elevation"=='--' THEN 0 ELSE replace("Min Elevation", ',', '')::numeric END as "Min Elevation",
  case when "Max Elevation"=='--' THEN 0 ELSE replace("Max Elevation", ',', '')::numeric END as "Max Elevation"
  FROM activity_import;

Querying

The question that I asked myself first after having a glance at the data was: "Which activity had the most span between min and max elevation". Of course in the real world this does not cover those activities with a lot of smaller ascents that repeat over the track. The query

SELECT "Activity Type", "Date", "Max Elevation", "Min Elevation", "Max Elevation" - "Min Elevation" AS difference 
	FROM activity_elevation 
	JOIN activity USING (id) 
ORDER BY difference DESC
LIMIT 10;

returns

Activity Type	Date	Max Elevation	Min Elevation	difference
Hiking	2022-08-01 11:02:30	2356.000	1353.000	1003.000
Cycling	2020-08-09 09:47:40	1924.000	1269.000	655.000
Cycling	2022-08-05 13:23:41	1953.000	1412.000	541.000
Cycling	2022-08-05 10:04:12	1684.000	1246.000	438.000
Walking	2021-08-05 11:13:26	1361.000	938.000	423.000
Hiking	2020-08-15 14:58:34	1940.000	1524.000	416.000
Cycling	2020-09-04 09:24:56	538.000	137.000	401.000
Hiking	2022-08-09 09:23:26	1783.000	1477.000	306.000
Hiking	2022-08-07 09:35:59	1731.000	1448.000	283.000
Cycling	2021-08-11 09:40:48	1524.000	1246.000	278.000

Those are my Top10 activities regarding elevation. And what surprises me the most: The join with the directive USING (id) just works as expected by matching the id columns of both tables.

Table-less

You might want to do everything in-place and just use DuckDB as a fancy CSV tool. The whole steps above can also be combined into a single statement:

SELECT "Activity Type", "Date",
	case when "Min Elevation"=='--' THEN 0 ELSE replace("Min Elevation", ',', '')::numeric END as "MinElevation",
	case when "Max Elevation"=='--' THEN 0 ELSE replace("Max Elevation", ',', '')::numeric END as "MaxElevation",
	"MaxElevation" - "MinElevation" AS difference
	from 'Activities.csv'
ORDER BY difference DESC 
LIMIT 10;

And here is the statement that I have used to create the example CSV file from my original export.

COPY (SELECT "Activity Type", "Min Elevation", "Max Elevation",
	concat("Activity Type", ' on ', times.Date::varchar) as Title,
	times.Date as Date
	from 'Activities_original.csv' TABLESAMPLE 20 POSITIONAL JOIN 
		(SELECT generate_series as Date from generate_series(TIMESTAMP '2023-04-10 08:00:00', TIMESTAMP '2023-06-11', INTERVAL 36 HOUR) LIMIT 20) as times
	)
TO 'Activities.csv' WITH (HEADER 1);

Activities.csv

Activity Type,Min Elevation,Max Elevation,Title,Date
Running,75,102,Running on 2023-04-10 08:00:00,2023-04-10 08:00:00
Running,"1,478","1,547",Running on 2023-04-11 20:00:00,2023-04-11 20:00:00
Cycling,65,94,Cycling on 2023-04-13 08:00:00,2023-04-13 08:00:00
Cycling,"1,246","1,684",Cycling on 2023-04-14 20:00:00,2023-04-14 20:00:00
Running,74,107,Running on 2023-04-16 08:00:00,2023-04-16 08:00:00
Hiking,"1,483","1,721",Hiking on 2023-04-17 20:00:00,2023-04-17 20:00:00
Running,-8,8,Running on 2023-04-19 08:00:00,2023-04-19 08:00:00
Walking,87,96,Walking on 2023-04-20 20:00:00,2023-04-20 20:00:00
Running,74,107,Running on 2023-04-22 08:00:00,2023-04-22 08:00:00
Running,74,107,Running on 2023-04-23 20:00:00,2023-04-23 20:00:00
Running,74,107,Running on 2023-04-25 08:00:00,2023-04-25 08:00:00
Running,75,102,Running on 2023-04-26 20:00:00,2023-04-26 20:00:00
Running,70,107,Running on 2023-04-28 08:00:00,2023-04-28 08:00:00
Gravel/Unpaved Cycling,76,93,Gravel/Unpaved Cycling on 2023-04-29 20:00:00,2023-04-29 20:00:00
Pool Swimming,--,--,Pool Swimming on 2023-05-01 08:00:00,2023-05-01 08:00:00
Cycling,137,538,Cycling on 2023-05-02 20:00:00,2023-05-02 20:00:00
Walking,964,"1,085",Walking on 2023-05-04 08:00:00,2023-05-04 08:00:00
Walking,75,102,Walking on 2023-05-05 20:00:00,2023-05-05 20:00:00
Cycling,77,112,Cycling on 2023-05-07 08:00:00,2023-05-07 08:00:00
Running,"1,449","1,545",Running on 2023-05-08 20:00:00,2023-05-08 20:00:00