-
-
Notifications
You must be signed in to change notification settings - Fork 39
The `map` Function and disk.frame's usage of memory
In disk.frame each chunk of a data set is stored as fst files. The most common way of working with a file is to use the map function. The map function takes as arguments a disk.frame and a function, f. The idea is that the function f will be applied to each chunk of the disk.frame. By default, the application is lazy, meaning that the results are not returned immediately.
If run lazily (the default), the user has to call collect on the resultant "disk.frame". The disk.frame is in quotes because the user can choose not to return a data.frame in which case the returned data is a list. Of course, one may wish to write out the results as another disk.frame. In this case, we may b need to use Th writr_diskftame function.
Normally, parallel processing is performed by map. And the number of parallel processes defaults to the number of physical CPU cores (so hyperthreading on Intel CPUs do not double the number of processes. A process is also called a worker. So the amount of memories required should be sufficient to carry N simultaneous operations of f(CHUNK) where N is the number of workers.
Also, one has remember that each workrt is independent and do not have access to variable crated by another to worker. However, all workers have access to variables in the global environment that it can detect as being used. So when can a global variable be detected? Generally it can be detected if it was used in open code, but not so if it's used in a string e.g. glue::glue("bar {{vars1}}") is not going to work.