FileTrees is a set of tools to lazy-load, process and save file trees. Built-in parallelism allows you to max out all threads and processes that Julia is running with.
Files and subtrees in a file tree can have any value attached to them, you can map and reduce over these values, or combine them by merging or collapsing trees or subtrees. When computing lazy trees, these values are held in distributed memory and operated on in parallel.
Tree operations such as map
, filter
, mv
, merge
, diff
are immutable. Nothing is written to disk until save
is called to save a tree, hence tree restructuring is cheap and fast.
You can install FileTrees with:
using Pkg
Pkg.add("FileTrees")
In this article
Below, we will see how to load a directory of files, do something to them, and then combine the results. This should help you get started!
Follow along
You can navigate to page/
folder under the FileTrees package directory to try this out for yourself with the sample data there.
julia -e 'using Pkg; Pkg.dev("FileTrees")'
cd ~/.julia/dev/FileTrees/page/
Or you can try it with your own directory of data files!
The basic datastructure in FileTrees is the FileTree
.
Calling FileTree
with a directory name will walk the directory on disk and construct a FileTree
. Here we have a tiny sampling of data from the NYC Taxi dataset, for January and February of 2019 and 2020. Let's read the FileTree from this directory:
using FileTrees
taxi_dir = FileTree("taxi-data")
taxi-data/
├─ 2019/
│ ├─ 01/
│ │ ├─ green.csv
│ │ └─ yellow.csv
│ └─ 02/
│ ├─ green.csv
│ └─ yellow.csv
└─ 2020/
├─ 01/
│ ├─ green.csv
│ └─ yellow.csv
└─ 02/
├─ green.csv
└─ yellow.csv
The files in the directory can be loaded using the load
function. Here we will use CSV and DataFrames to load the csv files.
using DataFrames, CSV
dfs = FileTrees.load(taxi_dir) do file
DataFrame(CSV.File(path(file)))
end
taxi-data/
├─ 2019/
│ ├─ 01/
│ │ ├─ green.csv (9×20 DataFrame)
│ │ └─ yellow.csv (9×18 DataFrame)
│ └─ 02/
│ ├─ green.csv (9×20 DataFrame)
│ └─ yellow.csv (9×18 DataFrame)
└─ 2020/
├─ 01/
│ ├─ green.csv (9×20 DataFrame)
│ └─ yellow.csv (9×18 DataFrame)
└─ 02/
├─ green.csv (9×20 DataFrame)
└─ yellow.csv (9×18 DataFrame)
A summary of the value loaded into each file is shown in parentheses. The file
argument passed to the load callback is a File
object. It supports the name
, path
function among others. path
returns an AbstractPath
which refers to the file's location.
load
returns a new FileTree
which has the same structure as before, but contains the loaded data in each File
node.
Here load
actually read the files eagerly. This may not be desirable if the data to be loaded are too big to fit in memory, or you don't intend to use all of it, but only a subtree of it.
In such a case, you can load the files lazily using lazy=true
lazy_dfs = FileTrees.load(taxi_dir; lazy=true) do file
DataFrame(CSV.File(path(file)))
end
taxi-data/
├─ 2019/
│ ├─ 01/
│ │ ├─ green.csv (Thunk(#3, (File(taxi-data/2019/01/green.csv),)))
│ │ └─ yellow.csv (Thunk(#3, (File(taxi-data/2019/01/yellow.csv),)))
│ └─ 02/
│ ├─ green.csv (Thunk(#3, (File(taxi-data/2019/02/green.csv),)))
│ └─ yellow.csv (Thunk(#3, (File(taxi-data/2019/02/yellow.csv),)))
└─ 2020/
├─ 01/
│ ├─ green.csv (Thunk(#3, (File(taxi-data/2020/01/green.csv),)))
│ └─ yellow.csv (Thunk(#3, (File(taxi-data/2020/01/yellow.csv),)))
└─ 02/
├─ green.csv (Thunk(#3, (File(taxi-data/2020/02/green.csv),)))
└─ yellow.csv (Thunk(#3, (File(taxi-data/2020/02/yellow.csv),)))
As you can see the nodes have Thunk
objects – this represents a lazy task that can later be executed using the exec
function. You can continue to use most of the functions in this package without worrying about whether the input tree has lazy values or not. You will get the corresponding lazy outputs wherever the input trees had lazy values. Lazy values also encode dependency between them, hence making it possible for exec
to compute them in parallel.
See this article to learn more about how to work with values. To know more details about the usage of laziness and parallelism, go to this article.
Let's look at one of these DataFrames by indexing into the tree with the path to a file, namely "2020/01/yellow.csv"
.
yellow_jan_20 = dfs["2020/01/yellow.csv"]
File(taxi-data/2020/01/yellow.csv)
get(file)
fetches the value stored in a File
or FileTree
node:
get(yellow_jan_20)
9×18 DataFrame
│ Row │ VendorID │ tpep_pickup_datetime │ tpep_dropoff_datetime │ passenger_count │ trip_distance │ RatecodeID │ store_and_fwd_flag │ PULocationID │ DOLocationID │ payment_type │ fare_amount │ extra │ mta_tax │ tip_amount │ tolls_amount │ improvement_surcharge │ total_amount │ congestion_surcharge │
│ │ Int64 │ String │ String │ Int64 │ Float64 │ Int64 │ String │ Int64 │ Int64 │ Int64 │ Float64 │ Float64 │ Float64 │ Float64 │ Int64 │ Float64 │ Float64 │ Float64 │
├─────┼──────────┼──────────────────────┼───────────────────────┼─────────────────┼───────────────┼────────────┼────────────────────┼──────────────┼──────────────┼──────────────┼─────────────┼─────────┼─────────┼────────────┼──────────────┼───────────────────────┼──────────────┼──────────────────────┤
│ 1 │ 1 │ 2020-01-01 00:28:15 │ 2020-01-01 00:33:03 │ 1 │ 1.2 │ 1 │ N │ 238 │ 239 │ 1 │ 6.0 │ 3.0 │ 0.5 │ 1.47 │ 0 │ 0.3 │ 11.27 │ 2.5 │
│ 2 │ 1 │ 2020-01-01 00:35:39 │ 2020-01-01 00:43:04 │ 1 │ 1.2 │ 1 │ N │ 239 │ 238 │ 1 │ 7.0 │ 3.0 │ 0.5 │ 1.5 │ 0 │ 0.3 │ 12.3 │ 2.5 │
│ 3 │ 1 │ 2020-01-01 00:47:41 │ 2020-01-01 00:53:52 │ 1 │ 0.6 │ 1 │ N │ 238 │ 238 │ 1 │ 6.0 │ 3.0 │ 0.5 │ 1.0 │ 0 │ 0.3 │ 10.8 │ 2.5 │
│ 4 │ 1 │ 2020-01-01 00:55:23 │ 2020-01-01 01:00:14 │ 1 │ 0.8 │ 1 │ N │ 238 │ 151 │ 1 │ 5.5 │ 0.5 │ 0.5 │ 1.36 │ 0 │ 0.3 │ 8.16 │ 0.0 │
│ 5 │ 2 │ 2020-01-01 00:01:58 │ 2020-01-01 00:04:16 │ 1 │ 0.0 │ 1 │ N │ 193 │ 193 │ 2 │ 3.5 │ 0.5 │ 0.5 │ 0.0 │ 0 │ 0.3 │ 4.8 │ 0.0 │
│ 6 │ 2 │ 2020-01-01 00:09:44 │ 2020-01-01 00:10:37 │ 1 │ 0.03 │ 1 │ N │ 7 │ 193 │ 2 │ 2.5 │ 0.5 │ 0.5 │ 0.0 │ 0 │ 0.3 │ 3.8 │ 0.0 │
│ 7 │ 2 │ 2020-01-01 00:39:25 │ 2020-01-01 00:39:29 │ 1 │ 0.0 │ 1 │ N │ 193 │ 193 │ 1 │ 2.5 │ 0.5 │ 0.5 │ 0.01 │ 0 │ 0.3 │ 3.81 │ 0.0 │
│ 8 │ 2 │ 2019-12-18 15:27:49 │ 2019-12-18 15:28:59 │ 1 │ 0.0 │ 5 │ N │ 193 │ 193 │ 1 │ 0.01 │ 0.0 │ 0.0 │ 0.0 │ 0 │ 0.3 │ 2.81 │ 2.5 │
│ 9 │ 2 │ 2019-12-18 15:30:35 │ 2019-12-18 15:31:35 │ 4 │ 0.0 │ 1 │ N │ 193 │ 193 │ 1 │ 2.5 │ 0.5 │ 0.5 │ 0.0 │ 0 │ 0.3 │ 6.3 │ 2.5 │
When a tree is lazy, the get
operation returns a Thunk
, a delayed computation.
You can call exec
on the this value to compute and fetch the value.
val = get(lazy_dfs["2020/01/yellow.csv"])
@show typeof(val)
@show exec(val);
typeof(val) = Dagger.Thunk
exec(val) = 9×18 DataFrame
│ Row │ VendorID │ tpep_pickup_datetime │ tpep_dropoff_datetime │ passenger_count │ trip_distance │ RatecodeID │ store_and_fwd_flag │ PULocationID │ DOLocationID │ payment_type │ fare_amount │ extra │ mta_tax │ tip_amount │ tolls_amount │ improvement_surcharge │ total_amount │ congestion_surcharge │
│ │ Int64 │ String │ String │ Int64 │ Float64 │ Int64 │ String │ Int64 │ Int64 │ Int64 │ Float64 │ Float64 │ Float64 │ Float64 │ Int64 │ Float64 │ Float64 │ Float64 │
├─────┼──────────┼──────────────────────┼───────────────────────┼─────────────────┼───────────────┼────────────┼────────────────────┼──────────────┼──────────────┼──────────────┼─────────────┼─────────┼─────────┼────────────┼──────────────┼───────────────────────┼──────────────┼──────────────────────┤
│ 1 │ 1 │ 2020-01-01 00:28:15 │ 2020-01-01 00:33:03 │ 1 │ 1.2 │ 1 │ N │ 238 │ 239 │ 1 │ 6.0 │ 3.0 │ 0.5 │ 1.47 │ 0 │ 0.3 │ 11.27 │ 2.5 │
│ 2 │ 1 │ 2020-01-01 00:35:39 │ 2020-01-01 00:43:04 │ 1 │ 1.2 │ 1 │ N │ 239 │ 238 │ 1 │ 7.0 │ 3.0 │ 0.5 │ 1.5 │ 0 │ 0.3 │ 12.3 │ 2.5 │
│ 3 │ 1 │ 2020-01-01 00:47:41 │ 2020-01-01 00:53:52 │ 1 │ 0.6 │ 1 │ N │ 238 │ 238 │ 1 │ 6.0 │ 3.0 │ 0.5 │ 1.0 │ 0 │ 0.3 │ 10.8 │ 2.5 │
│ 4 │ 1 │ 2020-01-01 00:55:23 │ 2020-01-01 01:00:14 │ 1 │ 0.8 │ 1 │ N │ 238 │ 151 │ 1 │ 5.5 │ 0.5 │ 0.5 │ 1.36 │ 0 │ 0.3 │ 8.16 │ 0.0 │
│ 5 │ 2 │ 2020-01-01 00:01:58 │ 2020-01-01 00:04:16 │ 1 │ 0.0 │ 1 │ N │ 193 │ 193 │ 2 │ 3.5 │ 0.5 │ 0.5 │ 0.0 │ 0 │ 0.3 │ 4.8 │ 0.0 │
│ 6 │ 2 │ 2020-01-01 00:09:44 │ 2020-01-01 00:10:37 │ 1 │ 0.03 │ 1 │ N │ 7 │ 193 │ 2 │ 2.5 │ 0.5 │ 0.5 │ 0.0 │ 0 │ 0.3 │ 3.8 │ 0.0 │
│ 7 │ 2 │ 2020-01-01 00:39:25 │ 2020-01-01 00:39:29 │ 1 │ 0.0 │ 1 │ N │ 193 │ 193 │ 1 │ 2.5 │ 0.5 │ 0.5 │ 0.01 │ 0 │ 0.3 │ 3.81 │ 0.0 │
│ 8 │ 2 │ 2019-12-18 15:27:49 │ 2019-12-18 15:28:59 │ 1 │ 0.0 │ 5 │ N │ 193 │ 193 │ 1 │ 0.01 │ 0.0 │ 0.0 │ 0.0 │ 0 │ 0.3 │ 2.81 │ 2.5 │
│ 9 │ 2 │ 2019-12-18 15:30:35 │ 2019-12-18 15:31:35 │ 4 │ 0.0 │ 1 │ N │ 193 │ 193 │ 1 │ 2.5 │ 0.5 │ 0.5 │ 0.0 │ 0 │ 0.3 │ 6.3 │ 2.5 │
Yellow and Green taxi data have different set of columns. It may be convenient to separate them out into two trees:
yellow = dfs[glob"*/*/yellow.csv"]
green = dfs[glob"*/*/green.csv"];
[yellow green]
1×2 Array{FileTrees.FileTree,2}:
taxi-data/
├─ 2019/
│ ├─ 01/
│ │ └─ yellow.csv (9×18 DataFrame)
│ └─ 02/
│ └─ yellow.csv (9×18 DataFrame)
└─ 2020/
├─ 01/
│ └─ yellow.csv (9×18 DataFrame)
└─ 02/
└─ yellow.csv (9×18 DataFrame)
taxi-data/
├─ 2019/
│ ├─ 01/
│ │ └─ green.csv (9×20 DataFrame)
│ └─ 02/
│ └─ green.csv (9×20 DataFrame)
└─ 2020/
├─ 01/
│ └─ green.csv (9×20 DataFrame)
└─ 02/
└─ green.csv (9×20 DataFrame)
Here we used a glob expression constructed with the glob""
string macro. This macro is provided by Glob.jl and is re-exported by FileTrees.
See the pattern matching documentation to learn more about how to use pattern matching to manipulate trees.
Now that we have files with the same schema in different trees, we can reduce either tree with vcat
function on DataFrames to combine the dataframes into a single dataframe:
yellowdf = reducevalues(vcat, yellow)
first(yellowdf, 15)
15×18 DataFrame
│ Row │ VendorID │ tpep_pickup_datetime │ tpep_dropoff_datetime │ passenger_count │ trip_distance │ RatecodeID │ store_and_fwd_flag │ PULocationID │ DOLocationID │ payment_type │ fare_amount │ extra │ mta_tax │ tip_amount │ tolls_amount │ improvement_surcharge │ total_amount │ congestion_surcharge │
│ │ Int64 │ String │ String │ Int64 │ Float64 │ Int64 │ String │ Int64 │ Int64 │ Int64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64? │
├─────┼──────────┼──────────────────────┼───────────────────────┼─────────────────┼───────────────┼────────────┼────────────────────┼──────────────┼──────────────┼──────────────┼─────────────┼─────────┼─────────┼────────────┼──────────────┼───────────────────────┼──────────────┼──────────────────────┤
│ 1 │ 1 │ 2019-01-01 00:46:40 │ 2019-01-01 00:53:20 │ 1 │ 1.5 │ 1 │ N │ 151 │ 239 │ 1 │ 7.0 │ 0.5 │ 0.5 │ 1.65 │ 0.0 │ 0.3 │ 9.95 │ missing │
│ 2 │ 1 │ 2019-01-01 00:59:47 │ 2019-01-01 01:18:59 │ 1 │ 2.6 │ 1 │ N │ 239 │ 246 │ 1 │ 14.0 │ 0.5 │ 0.5 │ 1.0 │ 0.0 │ 0.3 │ 16.3 │ missing │
│ 3 │ 2 │ 2018-12-21 13:48:30 │ 2018-12-21 13:52:40 │ 3 │ 0.0 │ 1 │ N │ 236 │ 236 │ 1 │ 4.5 │ 0.5 │ 0.5 │ 0.0 │ 0.0 │ 0.3 │ 5.8 │ missing │
│ 4 │ 2 │ 2018-11-28 15:52:25 │ 2018-11-28 15:55:45 │ 5 │ 0.0 │ 1 │ N │ 193 │ 193 │ 2 │ 3.5 │ 0.5 │ 0.5 │ 0.0 │ 0.0 │ 0.3 │ 7.55 │ missing │
│ 5 │ 2 │ 2018-11-28 15:56:57 │ 2018-11-28 15:58:33 │ 5 │ 0.0 │ 2 │ N │ 193 │ 193 │ 2 │ 52.0 │ 0.0 │ 0.5 │ 0.0 │ 0.0 │ 0.3 │ 55.55 │ missing │
│ 6 │ 2 │ 2018-11-28 16:25:49 │ 2018-11-28 16:28:26 │ 5 │ 0.0 │ 1 │ N │ 193 │ 193 │ 2 │ 3.5 │ 0.5 │ 0.5 │ 0.0 │ 5.76 │ 0.3 │ 13.31 │ missing │
│ 7 │ 2 │ 2018-11-28 16:29:37 │ 2018-11-28 16:33:43 │ 5 │ 0.0 │ 2 │ N │ 193 │ 193 │ 2 │ 52.0 │ 0.0 │ 0.5 │ 0.0 │ 0.0 │ 0.3 │ 55.55 │ missing │
│ 8 │ 1 │ 2019-01-01 00:21:28 │ 2019-01-01 00:28:37 │ 1 │ 1.3 │ 1 │ N │ 163 │ 229 │ 1 │ 6.5 │ 0.5 │ 0.5 │ 1.25 │ 0.0 │ 0.3 │ 9.05 │ missing │
│ 9 │ 1 │ 2019-01-01 00:32:01 │ 2019-01-01 00:45:39 │ 1 │ 3.7 │ 1 │ N │ 229 │ 7 │ 1 │ 13.5 │ 0.5 │ 0.5 │ 3.7 │ 0.0 │ 0.3 │ 18.5 │ missing │
│ 10 │ 1 │ 2019-02-01 00:59:04 │ 2019-02-01 01:07:27 │ 1 │ 2.1 │ 1 │ N │ 48 │ 234 │ 1 │ 9.0 │ 0.5 │ 0.5 │ 2.0 │ 0.0 │ 0.3 │ 12.3 │ 0.0 │
│ 11 │ 1 │ 2019-02-01 00:33:09 │ 2019-02-01 01:03:58 │ 1 │ 9.8 │ 1 │ N │ 230 │ 93 │ 2 │ 32.0 │ 0.5 │ 0.5 │ 0.0 │ 0.0 │ 0.3 │ 33.3 │ 0.0 │
│ 12 │ 1 │ 2019-02-01 00:09:03 │ 2019-02-01 00:09:16 │ 1 │ 0.0 │ 1 │ N │ 145 │ 145 │ 2 │ 2.5 │ 0.5 │ 0.5 │ 0.0 │ 0.0 │ 0.3 │ 3.8 │ 0.0 │
│ 13 │ 1 │ 2019-02-01 00:45:38 │ 2019-02-01 00:51:10 │ 1 │ 0.8 │ 1 │ N │ 95 │ 95 │ 2 │ 5.5 │ 0.5 │ 0.5 │ 0.0 │ 0.0 │ 0.3 │ 6.8 │ 0.0 │
│ 14 │ 1 │ 2019-02-01 00:25:30 │ 2019-02-01 00:28:14 │ 1 │ 0.8 │ 1 │ N │ 140 │ 263 │ 2 │ 5.0 │ 0.5 │ 0.5 │ 0.0 │ 0.0 │ 0.3 │ 6.3 │ 0.0 │
│ 15 │ 1 │ 2019-02-01 00:38:02 │ 2019-02-01 00:40:57 │ 1 │ 0.8 │ 1 │ N │ 229 │ 141 │ 2 │ 4.5 │ 0.5 │ 0.5 │ 0.0 │ 0.0 │ 0.3 │ 5.8 │ 0.0 │
reducevalues
also works on the lazy tree but returns a lazy final result. You can call exec
on it to actually compute it. This causes the computation to occur in parallel!
yellowdf = exec(reducevalues(vcat, lazy_dfs[glob"*/*/yellow.csv"]))
first(yellowdf, 15)
15×18 DataFrame
│ Row │ VendorID │ tpep_pickup_datetime │ tpep_dropoff_datetime │ passenger_count │ trip_distance │ RatecodeID │ store_and_fwd_flag │ PULocationID │ DOLocationID │ payment_type │ fare_amount │ extra │ mta_tax │ tip_amount │ tolls_amount │ improvement_surcharge │ total_amount │ congestion_surcharge │
│ │ Int64 │ String │ String │ Int64 │ Float64 │ Int64 │ String │ Int64 │ Int64 │ Int64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64? │
├─────┼──────────┼──────────────────────┼───────────────────────┼─────────────────┼───────────────┼────────────┼────────────────────┼──────────────┼──────────────┼──────────────┼─────────────┼─────────┼─────────┼────────────┼──────────────┼───────────────────────┼──────────────┼──────────────────────┤
│ 1 │ 1 │ 2019-01-01 00:46:40 │ 2019-01-01 00:53:20 │ 1 │ 1.5 │ 1 │ N │ 151 │ 239 │ 1 │ 7.0 │ 0.5 │ 0.5 │ 1.65 │ 0.0 │ 0.3 │ 9.95 │ missing │
│ 2 │ 1 │ 2019-01-01 00:59:47 │ 2019-01-01 01:18:59 │ 1 │ 2.6 │ 1 │ N │ 239 │ 246 │ 1 │ 14.0 │ 0.5 │ 0.5 │ 1.0 │ 0.0 │ 0.3 │ 16.3 │ missing │
│ 3 │ 2 │ 2018-12-21 13:48:30 │ 2018-12-21 13:52:40 │ 3 │ 0.0 │ 1 │ N │ 236 │ 236 │ 1 │ 4.5 │ 0.5 │ 0.5 │ 0.0 │ 0.0 │ 0.3 │ 5.8 │ missing │
│ 4 │ 2 │ 2018-11-28 15:52:25 │ 2018-11-28 15:55:45 │ 5 │ 0.0 │ 1 │ N │ 193 │ 193 │ 2 │ 3.5 │ 0.5 │ 0.5 │ 0.0 │ 0.0 │ 0.3 │ 7.55 │ missing │
│ 5 │ 2 │ 2018-11-28 15:56:57 │ 2018-11-28 15:58:33 │ 5 │ 0.0 │ 2 │ N │ 193 │ 193 │ 2 │ 52.0 │ 0.0 │ 0.5 │ 0.0 │ 0.0 │ 0.3 │ 55.55 │ missing │
│ 6 │ 2 │ 2018-11-28 16:25:49 │ 2018-11-28 16:28:26 │ 5 │ 0.0 │ 1 │ N │ 193 │ 193 │ 2 │ 3.5 │ 0.5 │ 0.5 │ 0.0 │ 5.76 │ 0.3 │ 13.31 │ missing │
│ 7 │ 2 │ 2018-11-28 16:29:37 │ 2018-11-28 16:33:43 │ 5 │ 0.0 │ 2 │ N │ 193 │ 193 │ 2 │ 52.0 │ 0.0 │ 0.5 │ 0.0 │ 0.0 │ 0.3 │ 55.55 │ missing │
│ 8 │ 1 │ 2019-01-01 00:21:28 │ 2019-01-01 00:28:37 │ 1 │ 1.3 │ 1 │ N │ 163 │ 229 │ 1 │ 6.5 │ 0.5 │ 0.5 │ 1.25 │ 0.0 │ 0.3 │ 9.05 │ missing │
│ 9 │ 1 │ 2019-01-01 00:32:01 │ 2019-01-01 00:45:39 │ 1 │ 3.7 │ 1 │ N │ 229 │ 7 │ 1 │ 13.5 │ 0.5 │ 0.5 │ 3.7 │ 0.0 │ 0.3 │ 18.5 │ missing │
│ 10 │ 1 │ 2019-02-01 00:59:04 │ 2019-02-01 01:07:27 │ 1 │ 2.1 │ 1 │ N │ 48 │ 234 │ 1 │ 9.0 │ 0.5 │ 0.5 │ 2.0 │ 0.0 │ 0.3 │ 12.3 │ 0.0 │
│ 11 │ 1 │ 2019-02-01 00:33:09 │ 2019-02-01 01:03:58 │ 1 │ 9.8 │ 1 │ N │ 230 │ 93 │ 2 │ 32.0 │ 0.5 │ 0.5 │ 0.0 │ 0.0 │ 0.3 │ 33.3 │ 0.0 │
│ 12 │ 1 │ 2019-02-01 00:09:03 │ 2019-02-01 00:09:16 │ 1 │ 0.0 │ 1 │ N │ 145 │ 145 │ 2 │ 2.5 │ 0.5 │ 0.5 │ 0.0 │ 0.0 │ 0.3 │ 3.8 │ 0.0 │
│ 13 │ 1 │ 2019-02-01 00:45:38 │ 2019-02-01 00:51:10 │ 1 │ 0.8 │ 1 │ N │ 95 │ 95 │ 2 │ 5.5 │ 0.5 │ 0.5 │ 0.0 │ 0.0 │ 0.3 │ 6.8 │ 0.0 │
│ 14 │ 1 │ 2019-02-01 00:25:30 │ 2019-02-01 00:28:14 │ 1 │ 0.8 │ 1 │ N │ 140 │ 263 │ 2 │ 5.0 │ 0.5 │ 0.5 │ 0.0 │ 0.0 │ 0.3 │ 6.3 │ 0.0 │
│ 15 │ 1 │ 2019-02-01 00:38:02 │ 2019-02-01 00:40:57 │ 1 │ 0.8 │ 1 │ N │ 229 │ 141 │ 2 │ 4.5 │ 0.5 │ 0.5 │ 0.0 │ 0.0 │ 0.3 │ 5.8 │ 0.0 │
Note that in the lazy case the green csv files are never loaded since they are not required to compute the final result!
df1 = dfs[glob"*/*/yellow.csv"]
# this mv moves X/Y/yellow.csv to yellow/X/Y.csv
# see the Tree manipulation section of the docs for more
df2 = mv(df1, r"^([^/]*)/([^/]*)/yellow.csv$",
s"yellow/\1/\2.csv")["yellow"]
@show df2
FileTrees.save(setparent(df2, nothing)) do file
CSV.write(path(file), get(file))
end
df2 = yellow/
├─ 2019/
│ ├─ 01.csv (9×18 DataFrame)
│ └─ 02.csv (9×18 DataFrame)
└─ 2020/
├─ 01.csv (9×18 DataFrame)
└─ 02.csv (9×18 DataFrame)
It's saved!
# let's read back the new directory
FileTree("yellow")
yellow/
├─ 2019/
│ ├─ 01.csv
│ └─ 02.csv
└─ 2020/
├─ 01.csv
└─ 02.csv
Happy Hacking!
Next: More on values in trees →