FileTree manipulation

The tree manipulation functions are map, filter, mv, cp, rm, merge, diff, clip, and mapsubtrees in combination with other functions.

A lot of tree manipulation involves pattern matching, so we recommend you read the section on pattern matching first.

map and filter

map can be used to apply a function to every node in a file tree, to create a new file tree. This function should return a File or FileTree object.

filter can be used to filter only nodes that satisfy a given predicate function.

Both map and filter take a walk keyword argument which can be either FileTrees.prewalk or FileTrees.postwalk, they do pre-order traversal and post-order traversal of the tree respectively. By default both operate on both FileTree (subtree) nodes and File nodes. You can pass in dirs=false to work only on the file nodes.

merge

merge
method

merge(t1::FileTree, t2::FileTree; combine)

Merge two FileTrees. If files at the same path contain values, the combine callback will be called with their values to result in a new value.

If one of the dirs does not have a value, its corresponding argument will be NoValue() If any of the values is lazy, the output value is lazy as well.

diff and rm

diff
method

diff(t1::FileTree, t2::FileTree)

For each node in t2 remove a node in t1 at the same path if it exists. Returns the difference tree.

rm
method

rm(t::FileTree, pattern::Union{Glob, String, AbstractPath, Regex})

remove nodes which match pattern from the file tree.

mv and cp

The signature of mv is mv(tree::FileTree, r::Regex, s::SubstitutionString; combine).

For every file in tree whose path matches the regular expression r, rewrite its path as decided by s. All paths are to be matched with delimiter / on all platforms (including Windows).

mv and cp not only allow you to move or copy nodes within a FileTree but also merge many files by copying them to the same path. combine is a callback that is called with the values of two files when a file is moved to an already existing or already created path. By default it is set to error on name clashes where either of the nodes has a non-null value.

s can be a SubstitutionString, which is conveniently constructed using the s"" string macro.

Within the string, sequences of the form \N refer to the Nth capture group in the regex, and \g<groupname> refers to a named capture group with name groupname.

Example:

using FileTrees

tree = FileTree("taxi-data")
taxi-data/
├─ 2019/
│  ├─ 01/
│  │  ├─ green.csv
│  │  └─ yellow.csv
│  └─ 02/
│     ├─ green.csv
│     └─ yellow.csv
└─ 2020/
   ├─ 01/
   │  ├─ green.csv
   │  └─ yellow.csv
   └─ 02/
      ├─ green.csv
      └─ yellow.csv
# first move */*/yellow.csv to yellow/*/*.csv

t2 = mv(tree, r"^([^/]*)/([^/]*)/yellow.csv$", s"yellow/\1/\2.csv")

# move */*/green.csv to green/*/*.csv
mv(t2, r"^([^/]*)/([^/]*)/green.csv$", s"green/\1/\2.csv")
taxi-data/
├─ green/
│  ├─ 2019/
│  │  ├─ 01.csv
│  │  └─ 02.csv
│  └─ 2020/
│     ├─ 01.csv
│     └─ 02.csv
└─ yellow/
   ├─ 2019/
   │  ├─ 01.csv
   │  └─ 02.csv
   └─ 2020/
      ├─ 01.csv
      └─ 02.csv

It's also possible to just move all the yellow files into a single yellow.csv file.

mv(tree, r"^([^/]*)/([^/]*)/yellow.csv$", s"yellow.csv")
taxi-data/
├─ 2019/
│  ├─ 01/
│  │  └─ green.csv
│  └─ 02/
│     └─ green.csv
├─ 2020/
│  ├─ 01/
│  │  └─ green.csv
│  └─ 02/
│     └─ green.csv
└─ yellow.csv

This works when there is no value loaded into the tree, but it probably shouldn't. Let's see what happens when the yellow files have some values loaded in them:

using CSV, DataFrames
dfs = FileTrees.load(tree) do file
    DataFrame(CSV.File(path(file)))
end;
mv(dfs, r".*yellow.csv$", s"yellow.csv")
yellow.csv clashed with an existing file name at path taxi-data/./yellow.csv.
Pass `combine=f` to define how to combine them.

Oh oops! It says pass in combine=f where f can combine the values of the two clashing files. In our case we want to concatenate the DataFrames, so let's pass in vcat.

mv(dfs, r".*yellow.csv$", s"yellow.csv", combine=vcat)
taxi-data/
├─ 2019/
│  ├─ 01/
│  │  └─ green.csv (9×20 DataFrame)
│  └─ 02/
│     └─ green.csv (9×20 DataFrame)
├─ 2020/
│  ├─ 01/
│  │  └─ green.csv (9×20 DataFrame)
│  └─ 02/
│     └─ green.csv (9×20 DataFrame)
└─ yellow.csv (36×18 DataFrame)

As you can see, the final yellow.csv file has a value that is a combination of all the yellow.csv values.

We can do the same with the green files:

df1 = mv(dfs, r".*yellow.csv$", s"yellow.csv", combine=vcat)
df2 = mv(df1, r".*green.csv$", s"green.csv", combine=vcat)
taxi-data/
├─ green.csv (36×20 DataFrame)
└─ yellow.csv (36×18 DataFrame)

mapsubtrees

mapsubtrees(f, pattern) lets you apply a function to every node whose path matches pattern which is either a Glob or Regex (see also pattern matching).

f gets the subtree itself and may return a subtree which is to replace the one it matched. It can return nothing to delete the node in the output tree, returning any other value will cause it to empty the subtree and set the value of the node to the returned value.

This last behavior makes it equivalent to Julia's mapslices but on trees.

Suppose you have a nested tree of values, and you would like to join the data in the second level of the tree using vcat but the first level of the tree using hcat. This can be done in two stages: first use mapsubtrees to collapse the second level tree into a single value which is the vcat of all the values in each subtree. Then combine those results with an hcat.

To demonstrate this let's create a nested tree with a nice structure:

tree = maketree("dir"=>
                [string(i)=>[(name=string(j), value=(i,j)) for j in 1:5] for i=1:5])
dir/
├─ 1/
│  ├─ 1 ((1, 1))
│  ├─ 2 ((1, 2))
│  ├─ 3 ((1, 3))
│  ├─ 4 ((1, 4))
│  └─ 5 ((1, 5))
├─ 2/
│  ├─ 1 ((2, 1))
│  ├─ 2 ((2, 2))
│  ├─ 3 ((2, 3))
│  ├─ 4 ((2, 4))
│  └─ 5 ((2, 5))
├─ 3/
│  ├─ 1 ((3, 1))
│  ├─ 2 ((3, 2))
│  ├─ 3 ((3, 3))
│  ├─ 4 ((3, 4))
│  └─ 5 ((3, 5))
├─ 4/
│  ├─ 1 ((4, 1))
│  ├─ 2 ((4, 2))
│  ├─ 3 ((4, 3))
│  ├─ 4 ((4, 4))
│  └─ 5 ((4, 5))
└─ 5/
   ├─ 1 ((5, 1))
   ├─ 2 ((5, 2))
   ├─ 3 ((5, 3))
   ├─ 4 ((5, 4))
   └─ 5 ((5, 5))

Step 1: reduce level 2 onwards:

vcated = mapsubtrees(tree, glob"*") do subtree
    reducevalues(vcat, subtree)
end
dir/
├─ 1/ (5-element Array{Tuple{Int64,Int64},1})
├─ 2/ (5-element Array{Tuple{Int64,Int64},1})
├─ 3/ (5-element Array{Tuple{Int64,Int64},1})
├─ 4/ (5-element Array{Tuple{Int64,Int64},1})
└─ 5/ (5-element Array{Tuple{Int64,Int64},1})

Step 2: reduce intermediate results

reducevalues(hcat, vcated, dirs=true)
5×5 Array{Tuple{Int64,Int64},2}:
 (1, 1)  (2, 1)  (3, 1)  (4, 1)  (5, 1)
 (1, 2)  (2, 2)  (3, 2)  (4, 2)  (5, 2)
 (1, 3)  (2, 3)  (3, 3)  (4, 3)  (5, 3)
 (1, 4)  (2, 4)  (3, 4)  (4, 4)  (5, 4)
 (1, 5)  (2, 5)  (3, 5)  (4, 5)  (5, 5)

This can also be done lazily!

vcated = mapsubtrees(tree, glob"*") do subtree
    reducevalues(vcat, subtree, lazy=true)
end
dir/
├─ 1/ (Thunk(vcat, (Thunk(vcat, ...), Thunk(vcat, ...))))
├─ 2/ (Thunk(vcat, (Thunk(vcat, ...), Thunk(vcat, ...))))
├─ 3/ (Thunk(vcat, (Thunk(vcat, ...), Thunk(vcat, ...))))
├─ 4/ (Thunk(vcat, (Thunk(vcat, ...), Thunk(vcat, ...))))
└─ 5/ (Thunk(vcat, (Thunk(vcat, ...), Thunk(vcat, ...))))
final = reducevalues(hcat, vcated, dirs=true)
Thunk(hcat, (Thunk(hcat, ...), Thunk(hcat, ...)))
exec(final)
5×5 Array{Tuple{Int64,Int64},2}:
 (1, 1)  (2, 1)  (3, 1)  (4, 1)  (5, 1)
 (1, 2)  (2, 2)  (3, 2)  (4, 2)  (5, 2)
 (1, 3)  (2, 3)  (3, 3)  (4, 3)  (5, 3)
 (1, 4)  (2, 4)  (3, 4)  (4, 4)  (5, 4)
 (1, 5)  (2, 5)  (3, 5)  (4, 5)  (5, 5)