FileTrees.jl — overview

FileTrees is a set of tools to lazy-load, process and save file trees. Built-in parallelism allows you to max out all threads and processes that Julia is running with.

Files and subtrees in a file tree can have any value attached to them, you can map and reduce over these values, or combine them by merging or collapsing trees or subtrees. When computing lazy trees, these values are held in distributed memory and operated on in parallel.

Tree operations such as map, filter, mv, merge, diff are immutable. Nothing is written to disk until save is called to save a tree, hence tree restructuring is cheap and fast.

Getting started

You can install FileTrees with:

using Pkg
Pkg.add("FileTrees")

In this article

Below, we will see how to load a directory of files, do something to them, and then combine the results. This should help you get started!

Follow along

You can navigate to page/ folder under the FileTrees package directory to try this out for yourself with the sample data there.

julia -e 'using Pkg; Pkg.dev("FileTrees")'
cd ~/.julia/dev/FileTrees/page/

Or you can try it with your own directory of data files!

Loading directories

The basic datastructure in FileTrees is the FileTree.

Calling FileTree with a directory name will walk the directory on disk and construct a FileTree. Here we have a tiny sampling of data from the NYC Taxi dataset, for January and February of 2019 and 2020. Let's read the FileTree from this directory:

using FileTrees

taxi_dir = FileTree("taxi-data")
taxi-data/
├─ 2019/
│  ├─ 01/
│  │  ├─ green.csv
│  │  └─ yellow.csv
│  └─ 02/
│     ├─ green.csv
│     └─ yellow.csv
└─ 2020/
   ├─ 01/
   │  ├─ green.csv
   │  └─ yellow.csv
   └─ 02/
      ├─ green.csv
      └─ yellow.csv

The files in the directory can be loaded using the load function. Here we will use CSV and DataFrames to load the csv files.

using DataFrames, CSV

dfs = FileTrees.load(taxi_dir) do file
    DataFrame(CSV.File(path(file)))
end
taxi-data/
├─ 2019/
│  ├─ 01/
│  │  ├─ green.csv (9×20 DataFrame)
│  │  └─ yellow.csv (9×18 DataFrame)
│  └─ 02/
│     ├─ green.csv (9×20 DataFrame)
│     └─ yellow.csv (9×18 DataFrame)
└─ 2020/
   ├─ 01/
   │  ├─ green.csv (9×20 DataFrame)
   │  └─ yellow.csv (9×18 DataFrame)
   └─ 02/
      ├─ green.csv (9×20 DataFrame)
      └─ yellow.csv (9×18 DataFrame)

A summary of the value loaded into each file is shown in parentheses. The file argument passed to the load callback is a File object. It supports the name, path function among others. path returns an AbstractPath which refers to the file's location.

load returns a new FileTree which has the same structure as before, but contains the loaded data in each File node.

Here load actually read the files eagerly. This may not be desirable if the data to be loaded are too big to fit in memory, or you don't intend to use all of it, but only a subtree of it.

In such a case, you can load the files lazily using lazy=true

lazy_dfs = FileTrees.load(taxi_dir; lazy=true) do file
    DataFrame(CSV.File(path(file)))
end
taxi-data/
├─ 2019/
│  ├─ 01/
│  │  ├─ green.csv (Thunk(#3, (File(taxi-data/2019/01/green.csv),)))
│  │  └─ yellow.csv (Thunk(#3, (File(taxi-data/2019/01/yellow.csv),)))
│  └─ 02/
│     ├─ green.csv (Thunk(#3, (File(taxi-data/2019/02/green.csv),)))
│     └─ yellow.csv (Thunk(#3, (File(taxi-data/2019/02/yellow.csv),)))
└─ 2020/
   ├─ 01/
   │  ├─ green.csv (Thunk(#3, (File(taxi-data/2020/01/green.csv),)))
   │  └─ yellow.csv (Thunk(#3, (File(taxi-data/2020/01/yellow.csv),)))
   └─ 02/
      ├─ green.csv (Thunk(#3, (File(taxi-data/2020/02/green.csv),)))
      └─ yellow.csv (Thunk(#3, (File(taxi-data/2020/02/yellow.csv),)))

As you can see the nodes have Thunk objects – this represents a lazy task that can later be executed using the exec function. You can continue to use most of the functions in this package without worrying about whether the input tree has lazy values or not. You will get the corresponding lazy outputs wherever the input trees had lazy values. Lazy values also encode dependency between them, hence making it possible for exec to compute them in parallel.

See this article to learn more about how to work with values. To know more details about the usage of laziness and parallelism, go to this article.

Looking files up

Let's look at one of these DataFrames by indexing into the tree with the path to a file, namely "2020/01/yellow.csv".

yellow_jan_20 = dfs["2020/01/yellow.csv"]
File(taxi-data/2020/01/yellow.csv)

get(file) fetches the value stored in a File or FileTree node:

get(yellow_jan_20)
9×18 DataFrame
│ Row │ VendorID │ tpep_pickup_datetime │ tpep_dropoff_datetime │ passenger_count │ trip_distance │ RatecodeID │ store_and_fwd_flag │ PULocationID │ DOLocationID │ payment_type │ fare_amount │ extra   │ mta_tax │ tip_amount │ tolls_amount │ improvement_surcharge │ total_amount │ congestion_surcharge │
│     │ Int64    │ String               │ String                │ Int64           │ Float64       │ Int64      │ String             │ Int64        │ Int64        │ Int64        │ Float64     │ Float64 │ Float64 │ Float64    │ Int64        │ Float64               │ Float64      │ Float64              │
├─────┼──────────┼──────────────────────┼───────────────────────┼─────────────────┼───────────────┼────────────┼────────────────────┼──────────────┼──────────────┼──────────────┼─────────────┼─────────┼─────────┼────────────┼──────────────┼───────────────────────┼──────────────┼──────────────────────┤
│ 1   │ 1        │ 2020-01-01 00:28:15  │ 2020-01-01 00:33:03   │ 1               │ 1.2           │ 1          │ N                  │ 238          │ 239          │ 1            │ 6.0         │ 3.0     │ 0.5     │ 1.47       │ 0            │ 0.3                   │ 11.27        │ 2.5                  │
│ 2   │ 1        │ 2020-01-01 00:35:39  │ 2020-01-01 00:43:04   │ 1               │ 1.2           │ 1          │ N                  │ 239          │ 238          │ 1            │ 7.0         │ 3.0     │ 0.5     │ 1.5        │ 0            │ 0.3                   │ 12.3         │ 2.5                  │
│ 3   │ 1        │ 2020-01-01 00:47:41  │ 2020-01-01 00:53:52   │ 1               │ 0.6           │ 1          │ N                  │ 238          │ 238          │ 1            │ 6.0         │ 3.0     │ 0.5     │ 1.0        │ 0            │ 0.3                   │ 10.8         │ 2.5                  │
│ 4   │ 1        │ 2020-01-01 00:55:23  │ 2020-01-01 01:00:14   │ 1               │ 0.8           │ 1          │ N                  │ 238          │ 151          │ 1            │ 5.5         │ 0.5     │ 0.5     │ 1.36       │ 0            │ 0.3                   │ 8.16         │ 0.0                  │
│ 5   │ 2        │ 2020-01-01 00:01:58  │ 2020-01-01 00:04:16   │ 1               │ 0.0           │ 1          │ N                  │ 193          │ 193          │ 2            │ 3.5         │ 0.5     │ 0.5     │ 0.0        │ 0            │ 0.3                   │ 4.8          │ 0.0                  │
│ 6   │ 2        │ 2020-01-01 00:09:44  │ 2020-01-01 00:10:37   │ 1               │ 0.03          │ 1          │ N                  │ 7            │ 193          │ 2            │ 2.5         │ 0.5     │ 0.5     │ 0.0        │ 0            │ 0.3                   │ 3.8          │ 0.0                  │
│ 7   │ 2        │ 2020-01-01 00:39:25  │ 2020-01-01 00:39:29   │ 1               │ 0.0           │ 1          │ N                  │ 193          │ 193          │ 1            │ 2.5         │ 0.5     │ 0.5     │ 0.01       │ 0            │ 0.3                   │ 3.81         │ 0.0                  │
│ 8   │ 2        │ 2019-12-18 15:27:49  │ 2019-12-18 15:28:59   │ 1               │ 0.0           │ 5          │ N                  │ 193          │ 193          │ 1            │ 0.01        │ 0.0     │ 0.0     │ 0.0        │ 0            │ 0.3                   │ 2.81         │ 2.5                  │
│ 9   │ 2        │ 2019-12-18 15:30:35  │ 2019-12-18 15:31:35   │ 4               │ 0.0           │ 1          │ N                  │ 193          │ 193          │ 1            │ 2.5         │ 0.5     │ 0.5     │ 0.0        │ 0            │ 0.3                   │ 6.3          │ 2.5                  │

When a tree is lazy, the get operation returns a Thunk, a delayed computation.

You can call exec on the this value to compute and fetch the value.

val = get(lazy_dfs["2020/01/yellow.csv"])

@show typeof(val)
@show exec(val);
typeof(val) = Dagger.Thunk
exec(val) = 9×18 DataFrame
│ Row │ VendorID │ tpep_pickup_datetime │ tpep_dropoff_datetime │ passenger_count │ trip_distance │ RatecodeID │ store_and_fwd_flag │ PULocationID │ DOLocationID │ payment_type │ fare_amount │ extra   │ mta_tax │ tip_amount │ tolls_amount │ improvement_surcharge │ total_amount │ congestion_surcharge │
│     │ Int64    │ String               │ String                │ Int64           │ Float64       │ Int64      │ String             │ Int64        │ Int64        │ Int64        │ Float64     │ Float64 │ Float64 │ Float64    │ Int64        │ Float64               │ Float64      │ Float64              │
├─────┼──────────┼──────────────────────┼───────────────────────┼─────────────────┼───────────────┼────────────┼────────────────────┼──────────────┼──────────────┼──────────────┼─────────────┼─────────┼─────────┼────────────┼──────────────┼───────────────────────┼──────────────┼──────────────────────┤
│ 1   │ 1        │ 2020-01-01 00:28:15  │ 2020-01-01 00:33:03   │ 1               │ 1.2           │ 1          │ N                  │ 238          │ 239          │ 1            │ 6.0         │ 3.0     │ 0.5     │ 1.47       │ 0            │ 0.3                   │ 11.27        │ 2.5                  │
│ 2   │ 1        │ 2020-01-01 00:35:39  │ 2020-01-01 00:43:04   │ 1               │ 1.2           │ 1          │ N                  │ 239          │ 238          │ 1            │ 7.0         │ 3.0     │ 0.5     │ 1.5        │ 0            │ 0.3                   │ 12.3         │ 2.5                  │
│ 3   │ 1        │ 2020-01-01 00:47:41  │ 2020-01-01 00:53:52   │ 1               │ 0.6           │ 1          │ N                  │ 238          │ 238          │ 1            │ 6.0         │ 3.0     │ 0.5     │ 1.0        │ 0            │ 0.3                   │ 10.8         │ 2.5                  │
│ 4   │ 1        │ 2020-01-01 00:55:23  │ 2020-01-01 01:00:14   │ 1               │ 0.8           │ 1          │ N                  │ 238          │ 151          │ 1            │ 5.5         │ 0.5     │ 0.5     │ 1.36       │ 0            │ 0.3                   │ 8.16         │ 0.0                  │
│ 5   │ 2        │ 2020-01-01 00:01:58  │ 2020-01-01 00:04:16   │ 1               │ 0.0           │ 1          │ N                  │ 193          │ 193          │ 2            │ 3.5         │ 0.5     │ 0.5     │ 0.0        │ 0            │ 0.3                   │ 4.8          │ 0.0                  │
│ 6   │ 2        │ 2020-01-01 00:09:44  │ 2020-01-01 00:10:37   │ 1               │ 0.03          │ 1          │ N                  │ 7            │ 193          │ 2            │ 2.5         │ 0.5     │ 0.5     │ 0.0        │ 0            │ 0.3                   │ 3.8          │ 0.0                  │
│ 7   │ 2        │ 2020-01-01 00:39:25  │ 2020-01-01 00:39:29   │ 1               │ 0.0           │ 1          │ N                  │ 193          │ 193          │ 1            │ 2.5         │ 0.5     │ 0.5     │ 0.01       │ 0            │ 0.3                   │ 3.81         │ 0.0                  │
│ 8   │ 2        │ 2019-12-18 15:27:49  │ 2019-12-18 15:28:59   │ 1               │ 0.0           │ 5          │ N                  │ 193          │ 193          │ 1            │ 0.01        │ 0.0     │ 0.0     │ 0.0        │ 0            │ 0.3                   │ 2.81         │ 2.5                  │
│ 9   │ 2        │ 2019-12-18 15:30:35  │ 2019-12-18 15:31:35   │ 4               │ 0.0           │ 1          │ N                  │ 193          │ 193          │ 1            │ 2.5         │ 0.5     │ 0.5     │ 0.0        │ 0            │ 0.3                   │ 6.3          │ 2.5                  │

Yellow and Green taxi data have different set of columns. It may be convenient to separate them out into two trees:

yellow = dfs[glob"*/*/yellow.csv"]
green = dfs[glob"*/*/green.csv"];

[yellow green]
1×2 Array{FileTrees.FileTree,2}:
 taxi-data/
├─ 2019/
│  ├─ 01/
│  │  └─ yellow.csv (9×18 DataFrame)
│  └─ 02/
│     └─ yellow.csv (9×18 DataFrame)
└─ 2020/
   ├─ 01/
   │  └─ yellow.csv (9×18 DataFrame)
   └─ 02/
      └─ yellow.csv (9×18 DataFrame)
  taxi-data/
├─ 2019/
│  ├─ 01/
│  │  └─ green.csv (9×20 DataFrame)
│  └─ 02/
│     └─ green.csv (9×20 DataFrame)
└─ 2020/
   ├─ 01/
   │  └─ green.csv (9×20 DataFrame)
   └─ 02/
      └─ green.csv (9×20 DataFrame)

Here we used a glob expression constructed with the glob"" string macro. This macro is provided by Glob.jl and is re-exported by FileTrees.

See the pattern matching documentation to learn more about how to use pattern matching to manipulate trees.

Combining loaded data

Now that we have files with the same schema in different trees, we can reduce either tree with vcat function on DataFrames to combine the dataframes into a single dataframe:

yellowdf = reducevalues(vcat, yellow)

first(yellowdf, 15)
15×18 DataFrame
│ Row │ VendorID │ tpep_pickup_datetime │ tpep_dropoff_datetime │ passenger_count │ trip_distance │ RatecodeID │ store_and_fwd_flag │ PULocationID │ DOLocationID │ payment_type │ fare_amount │ extra   │ mta_tax │ tip_amount │ tolls_amount │ improvement_surcharge │ total_amount │ congestion_surcharge │
│     │ Int64    │ String               │ String                │ Int64           │ Float64       │ Int64      │ String             │ Int64        │ Int64        │ Int64        │ Float64     │ Float64 │ Float64 │ Float64    │ Float64      │ Float64               │ Float64      │ Float64?             │
├─────┼──────────┼──────────────────────┼───────────────────────┼─────────────────┼───────────────┼────────────┼────────────────────┼──────────────┼──────────────┼──────────────┼─────────────┼─────────┼─────────┼────────────┼──────────────┼───────────────────────┼──────────────┼──────────────────────┤
│ 1   │ 1        │ 2019-01-01 00:46:40  │ 2019-01-01 00:53:20   │ 1               │ 1.5           │ 1          │ N                  │ 151          │ 239          │ 1            │ 7.0         │ 0.5     │ 0.5     │ 1.65       │ 0.0          │ 0.3                   │ 9.95         │ missing              │
│ 2   │ 1        │ 2019-01-01 00:59:47  │ 2019-01-01 01:18:59   │ 1               │ 2.6           │ 1          │ N                  │ 239          │ 246          │ 1            │ 14.0        │ 0.5     │ 0.5     │ 1.0        │ 0.0          │ 0.3                   │ 16.3         │ missing              │
│ 3   │ 2        │ 2018-12-21 13:48:30  │ 2018-12-21 13:52:40   │ 3               │ 0.0           │ 1          │ N                  │ 236          │ 236          │ 1            │ 4.5         │ 0.5     │ 0.5     │ 0.0        │ 0.0          │ 0.3                   │ 5.8          │ missing              │
│ 4   │ 2        │ 2018-11-28 15:52:25  │ 2018-11-28 15:55:45   │ 5               │ 0.0           │ 1          │ N                  │ 193          │ 193          │ 2            │ 3.5         │ 0.5     │ 0.5     │ 0.0        │ 0.0          │ 0.3                   │ 7.55         │ missing              │
│ 5   │ 2        │ 2018-11-28 15:56:57  │ 2018-11-28 15:58:33   │ 5               │ 0.0           │ 2          │ N                  │ 193          │ 193          │ 2            │ 52.0        │ 0.0     │ 0.5     │ 0.0        │ 0.0          │ 0.3                   │ 55.55        │ missing              │
│ 6   │ 2        │ 2018-11-28 16:25:49  │ 2018-11-28 16:28:26   │ 5               │ 0.0           │ 1          │ N                  │ 193          │ 193          │ 2            │ 3.5         │ 0.5     │ 0.5     │ 0.0        │ 5.76         │ 0.3                   │ 13.31        │ missing              │
│ 7   │ 2        │ 2018-11-28 16:29:37  │ 2018-11-28 16:33:43   │ 5               │ 0.0           │ 2          │ N                  │ 193          │ 193          │ 2            │ 52.0        │ 0.0     │ 0.5     │ 0.0        │ 0.0          │ 0.3                   │ 55.55        │ missing              │
│ 8   │ 1        │ 2019-01-01 00:21:28  │ 2019-01-01 00:28:37   │ 1               │ 1.3           │ 1          │ N                  │ 163          │ 229          │ 1            │ 6.5         │ 0.5     │ 0.5     │ 1.25       │ 0.0          │ 0.3                   │ 9.05         │ missing              │
│ 9   │ 1        │ 2019-01-01 00:32:01  │ 2019-01-01 00:45:39   │ 1               │ 3.7           │ 1          │ N                  │ 229          │ 7            │ 1            │ 13.5        │ 0.5     │ 0.5     │ 3.7        │ 0.0          │ 0.3                   │ 18.5         │ missing              │
│ 10  │ 1        │ 2019-02-01 00:59:04  │ 2019-02-01 01:07:27   │ 1               │ 2.1           │ 1          │ N                  │ 48           │ 234          │ 1            │ 9.0         │ 0.5     │ 0.5     │ 2.0        │ 0.0          │ 0.3                   │ 12.3         │ 0.0                  │
│ 11  │ 1        │ 2019-02-01 00:33:09  │ 2019-02-01 01:03:58   │ 1               │ 9.8           │ 1          │ N                  │ 230          │ 93           │ 2            │ 32.0        │ 0.5     │ 0.5     │ 0.0        │ 0.0          │ 0.3                   │ 33.3         │ 0.0                  │
│ 12  │ 1        │ 2019-02-01 00:09:03  │ 2019-02-01 00:09:16   │ 1               │ 0.0           │ 1          │ N                  │ 145          │ 145          │ 2            │ 2.5         │ 0.5     │ 0.5     │ 0.0        │ 0.0          │ 0.3                   │ 3.8          │ 0.0                  │
│ 13  │ 1        │ 2019-02-01 00:45:38  │ 2019-02-01 00:51:10   │ 1               │ 0.8           │ 1          │ N                  │ 95           │ 95           │ 2            │ 5.5         │ 0.5     │ 0.5     │ 0.0        │ 0.0          │ 0.3                   │ 6.8          │ 0.0                  │
│ 14  │ 1        │ 2019-02-01 00:25:30  │ 2019-02-01 00:28:14   │ 1               │ 0.8           │ 1          │ N                  │ 140          │ 263          │ 2            │ 5.0         │ 0.5     │ 0.5     │ 0.0        │ 0.0          │ 0.3                   │ 6.3          │ 0.0                  │
│ 15  │ 1        │ 2019-02-01 00:38:02  │ 2019-02-01 00:40:57   │ 1               │ 0.8           │ 1          │ N                  │ 229          │ 141          │ 2            │ 4.5         │ 0.5     │ 0.5     │ 0.0        │ 0.0          │ 0.3                   │ 5.8          │ 0.0                  │

reducevalues also works on the lazy tree but returns a lazy final result. You can call exec on it to actually compute it. This causes the computation to occur in parallel!

yellowdf = exec(reducevalues(vcat, lazy_dfs[glob"*/*/yellow.csv"]))

first(yellowdf, 15)
15×18 DataFrame
│ Row │ VendorID │ tpep_pickup_datetime │ tpep_dropoff_datetime │ passenger_count │ trip_distance │ RatecodeID │ store_and_fwd_flag │ PULocationID │ DOLocationID │ payment_type │ fare_amount │ extra   │ mta_tax │ tip_amount │ tolls_amount │ improvement_surcharge │ total_amount │ congestion_surcharge │
│     │ Int64    │ String               │ String                │ Int64           │ Float64       │ Int64      │ String             │ Int64        │ Int64        │ Int64        │ Float64     │ Float64 │ Float64 │ Float64    │ Float64      │ Float64               │ Float64      │ Float64?             │
├─────┼──────────┼──────────────────────┼───────────────────────┼─────────────────┼───────────────┼────────────┼────────────────────┼──────────────┼──────────────┼──────────────┼─────────────┼─────────┼─────────┼────────────┼──────────────┼───────────────────────┼──────────────┼──────────────────────┤
│ 1   │ 1        │ 2019-01-01 00:46:40  │ 2019-01-01 00:53:20   │ 1               │ 1.5           │ 1          │ N                  │ 151          │ 239          │ 1            │ 7.0         │ 0.5     │ 0.5     │ 1.65       │ 0.0          │ 0.3                   │ 9.95         │ missing              │
│ 2   │ 1        │ 2019-01-01 00:59:47  │ 2019-01-01 01:18:59   │ 1               │ 2.6           │ 1          │ N                  │ 239          │ 246          │ 1            │ 14.0        │ 0.5     │ 0.5     │ 1.0        │ 0.0          │ 0.3                   │ 16.3         │ missing              │
│ 3   │ 2        │ 2018-12-21 13:48:30  │ 2018-12-21 13:52:40   │ 3               │ 0.0           │ 1          │ N                  │ 236          │ 236          │ 1            │ 4.5         │ 0.5     │ 0.5     │ 0.0        │ 0.0          │ 0.3                   │ 5.8          │ missing              │
│ 4   │ 2        │ 2018-11-28 15:52:25  │ 2018-11-28 15:55:45   │ 5               │ 0.0           │ 1          │ N                  │ 193          │ 193          │ 2            │ 3.5         │ 0.5     │ 0.5     │ 0.0        │ 0.0          │ 0.3                   │ 7.55         │ missing              │
│ 5   │ 2        │ 2018-11-28 15:56:57  │ 2018-11-28 15:58:33   │ 5               │ 0.0           │ 2          │ N                  │ 193          │ 193          │ 2            │ 52.0        │ 0.0     │ 0.5     │ 0.0        │ 0.0          │ 0.3                   │ 55.55        │ missing              │
│ 6   │ 2        │ 2018-11-28 16:25:49  │ 2018-11-28 16:28:26   │ 5               │ 0.0           │ 1          │ N                  │ 193          │ 193          │ 2            │ 3.5         │ 0.5     │ 0.5     │ 0.0        │ 5.76         │ 0.3                   │ 13.31        │ missing              │
│ 7   │ 2        │ 2018-11-28 16:29:37  │ 2018-11-28 16:33:43   │ 5               │ 0.0           │ 2          │ N                  │ 193          │ 193          │ 2            │ 52.0        │ 0.0     │ 0.5     │ 0.0        │ 0.0          │ 0.3                   │ 55.55        │ missing              │
│ 8   │ 1        │ 2019-01-01 00:21:28  │ 2019-01-01 00:28:37   │ 1               │ 1.3           │ 1          │ N                  │ 163          │ 229          │ 1            │ 6.5         │ 0.5     │ 0.5     │ 1.25       │ 0.0          │ 0.3                   │ 9.05         │ missing              │
│ 9   │ 1        │ 2019-01-01 00:32:01  │ 2019-01-01 00:45:39   │ 1               │ 3.7           │ 1          │ N                  │ 229          │ 7            │ 1            │ 13.5        │ 0.5     │ 0.5     │ 3.7        │ 0.0          │ 0.3                   │ 18.5         │ missing              │
│ 10  │ 1        │ 2019-02-01 00:59:04  │ 2019-02-01 01:07:27   │ 1               │ 2.1           │ 1          │ N                  │ 48           │ 234          │ 1            │ 9.0         │ 0.5     │ 0.5     │ 2.0        │ 0.0          │ 0.3                   │ 12.3         │ 0.0                  │
│ 11  │ 1        │ 2019-02-01 00:33:09  │ 2019-02-01 01:03:58   │ 1               │ 9.8           │ 1          │ N                  │ 230          │ 93           │ 2            │ 32.0        │ 0.5     │ 0.5     │ 0.0        │ 0.0          │ 0.3                   │ 33.3         │ 0.0                  │
│ 12  │ 1        │ 2019-02-01 00:09:03  │ 2019-02-01 00:09:16   │ 1               │ 0.0           │ 1          │ N                  │ 145          │ 145          │ 2            │ 2.5         │ 0.5     │ 0.5     │ 0.0        │ 0.0          │ 0.3                   │ 3.8          │ 0.0                  │
│ 13  │ 1        │ 2019-02-01 00:45:38  │ 2019-02-01 00:51:10   │ 1               │ 0.8           │ 1          │ N                  │ 95           │ 95           │ 2            │ 5.5         │ 0.5     │ 0.5     │ 0.0        │ 0.0          │ 0.3                   │ 6.8          │ 0.0                  │
│ 14  │ 1        │ 2019-02-01 00:25:30  │ 2019-02-01 00:28:14   │ 1               │ 0.8           │ 1          │ N                  │ 140          │ 263          │ 2            │ 5.0         │ 0.5     │ 0.5     │ 0.0        │ 0.0          │ 0.3                   │ 6.3          │ 0.0                  │
│ 15  │ 1        │ 2019-02-01 00:38:02  │ 2019-02-01 00:40:57   │ 1               │ 0.8           │ 1          │ N                  │ 229          │ 141          │ 2            │ 4.5         │ 0.5     │ 0.5     │ 0.0        │ 0.0          │ 0.3                   │ 5.8          │ 0.0                  │

Note that in the lazy case the green csv files are never loaded since they are not required to compute the final result!

Saving to a directory

df1 = dfs[glob"*/*/yellow.csv"]

# this mv moves X/Y/yellow.csv to yellow/X/Y.csv
# see the Tree manipulation section of the docs for more

df2 = mv(df1, r"^([^/]*)/([^/]*)/yellow.csv$",
              s"yellow/\1/\2.csv")["yellow"]

@show df2

FileTrees.save(setparent(df2, nothing)) do file
    CSV.write(path(file), get(file))
end
df2 = yellow/
├─ 2019/
│  ├─ 01.csv (9×18 DataFrame)
│  └─ 02.csv (9×18 DataFrame)
└─ 2020/
   ├─ 01.csv (9×18 DataFrame)
   └─ 02.csv (9×18 DataFrame)

It's saved!

# let's read back the new directory
FileTree("yellow")
yellow/
├─ 2019/
│  ├─ 01.csv
│  └─ 02.csv
└─ 2020/
   ├─ 01.csv
   └─ 02.csv

Happy Hacking!

Next: More on values in trees →