Splitting a file into many files

You can "splat" multiple files into a tree by returning maketree("." => fs). This will place the files fs in the same directory as the file it replaces.

In map or mapsubtrees you can use maketree("." => fs) where fs is a vector to "splat" files into multiple files.

A convenience function will come in handy:

splatfiles(fs) = maketree("." => fs)

splatfiles (generic function with 1 method)

using FileTrees
using DataFrames, CSV

taxi_dir = FileTree("taxi-data")

dfs = FileTrees.load(taxi_dir) do file
    DataFrame(CSV.File(path(file)))
end

taxi-data/
├─ 2019/
│  ├─ 01/
│  │  ├─ green.csv (9×20 DataFrame)
│  │  └─ yellow.csv (9×18 DataFrame)
│  └─ 02/
│     ├─ green.csv (9×20 DataFrame)
│     └─ yellow.csv (9×18 DataFrame)
└─ 2020/
   ├─ 01/
   │  ├─ green.csv (9×20 DataFrame)
   │  └─ yellow.csv (9×18 DataFrame)
   └─ 02/
      ├─ green.csv (9×20 DataFrame)
      └─ yellow.csv (9×18 DataFrame)

Split up each yellow file into multiple files:

yellowdfs = dfs[r"yellow.csv$"]

expanded_tree = mapsubtrees(yellowdfs, glob"*/*/yellow.csv") do df
    map(groupby(get(df), :RatecodeID) |> collect) do group
        (name=string("yellow-ratecode-", group.RatecodeID[1], ".df"), value=DataFrame(group))
    end |> splatfiles
end

taxi-data/
├─ 2019/
│  ├─ 01/
│  │  ├─ yellow-ratecode-1.df (7×18 DataFrame)
│  │  └─ yellow-ratecode-2.df (2×18 DataFrame)
│  └─ 02/
│     └─ yellow-ratecode-1.df (9×18 DataFrame)
└─ 2020/
   ├─ 01/
   │  ├─ yellow-ratecode-1.df (8×18 DataFrame)
   │  └─ yellow-ratecode-5.df (1×18 DataFrame)
   └─ 02/
      └─ yellow-ratecode-1.df (9×18 DataFrame)

You can save these files if you wish.

How to create lazy subtrees?

If the value field of a file passed to splatfiles is a Thunk, then it becomes a lazy value.

A thunk can be created with the syntax lazy(f)(x...). where the result is a Thunk which represents the result of executing f(x...).

yellowdfs = dfs[r"yellow.csv$"]

expanded_tree = mapsubtrees(yellowdfs, glob"*/*/yellow.csv") do df
    map(groupby(get(df), :payment_type) |> collect) do group
        id = group.payment_type[1]
        (name=string("yellow-ptype-", group.payment_type[1], ".df"), value=lazy(repr)(group))
    end |> splatfiles
end

taxi-data/
├─ 2019/
│  ├─ 01/
│  │  ├─ yellow-ptype-1.df (FileTrees.Thunk)
│  │  └─ yellow-ptype-2.df (FileTrees.Thunk)
│  └─ 02/
│     ├─ yellow-ptype-1.df (FileTrees.Thunk)
│     └─ yellow-ptype-2.df (FileTrees.Thunk)
└─ 2020/
   ├─ 01/
   │  ├─ yellow-ptype-1.df (FileTrees.Thunk)
   │  └─ yellow-ptype-2.df (FileTrees.Thunk)
   └─ 02/
      ├─ yellow-ptype-1.df (FileTrees.Thunk)
      └─ yellow-ptype-2.df (FileTrees.Thunk)

exec(expanded_tree)

taxi-data/
├─ 2019/
│  ├─ 01/
│  │  ├─ yellow-ptype-1.df (2850-codeunit String)
│  │  └─ yellow-ptype-2.df (2565-codeunit String)
│  └─ 02/
│     ├─ yellow-ptype-1.df (1708-codeunit String)
│     └─ yellow-ptype-2.df (3703-codeunit String)
└─ 2020/
   ├─ 01/
   │  ├─ yellow-ptype-1.df (3420-codeunit String)
   │  └─ yellow-ptype-2.df (1995-codeunit String)
   └─ 02/
      ├─ yellow-ptype-1.df (3420-codeunit String)
      └─ yellow-ptype-2.df (1995-codeunit String)

exec(expanded_tree) |> files |> first |> get |> print

5×18 SubDataFrame
 Row │ VendorID  tpep_pickup_datetime  tpep_dropoff_datetime  passenger_count  trip_distance  RatecodeID  store_and_fwd_flag  PULocationID  DOLocationID  payment_type  fare_amount  extra    mta_tax  tip_amount  tolls_amount  improvement_surcharge  total_amount  congestion_surcharge
     │ Int64     String31              String31               Int64            Float64        Int64       String1             Int64         Int64         Int64         Float64      Float64  Float64  Float64     Float64       Float64                Float64       Missing
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │        1  2019-01-01 00:46:40   2019-01-01 00:53:20                  1            1.5           1  N                            151           239             1          7.0      0.5      0.5        1.65           0.0                    0.3          9.95               missing
   2 │        1  2019-01-01 00:59:47   2019-01-01 01:18:59                  1            2.6           1  N                            239           246             1         14.0      0.5      0.5        1.0            0.0                    0.3         16.3                missing
   3 │        2  2018-12-21 13:48:30   2018-12-21 13:52:40                  3            0.0           1  N                            236           236             1          4.5      0.5      0.5        0.0            0.0                    0.3          5.8                missing
   4 │        1  2019-01-01 00:21:28   2019-01-01 00:28:37                  1            1.3           1  N                            163           229             1          6.5      0.5      0.5        1.25           0.0                    0.3          9.05               missing
   5 │        1  2019-01-01 00:32:01   2019-01-01 00:45:39                  1            3.7           1  N                            229             7             1         13.5      0.5      0.5        3.7            0.0                    0.3         18.5                missing

FileTrees.jl

Splitting a file into many files

How to create lazy subtrees?