Datasets with Arrow
Last updated
Last updated
enables large amounts of data to be processed and moved quickly. It is a specific data format that stores data in a columnar memory layout. This provides several significant advantages:
Arrowβs standard format allows which removes virtually all serialization overhead.
Arrow is language-agnostic so it supports different programming languages.
Arrow is column-oriented so it is faster at querying and processing slices or columns of data.
Arrow allows for copy-free hand-offs to standard machine learning tools such as NumPy, Pandas, PyTorch, and TensorFlow.
Arrow supports many, possibly nested, column types.
π Datasets uses Arrow for its local caching system. It allows datasets to be backed by an on-disk cache, which is memory-mapped for fast lookup. This architecture allows for large datasets to be used on machines with relatively small device memory.
For example, loading the full English Wikipedia dataset only takes a few MB of RAM:
Copied
This is possible because the Arrow data is actually memory-mapped from disk, and not loaded in memory. Memory-mapping allows access to data on disk, and leverages virtual memory capabilities for fast lookups.
Iterating over a memory-mapped dataset using Arrow is fast. Iterating over Wikipedia on a laptop gives you speeds of 1-3 Gbit/s:
Copied