Google AI introduces “TensorStore,” an open-source C++ and Python library designed for reading and writing large multidimensional arrays

Google AI introduces “TensorStore,” an open-source C++ and Python library designed for reading and writing large multidimensional arrays

Various modern computer science and machine learning applications use multi-dimensional data sets that cover a single extended coordinate system. Two examples use air measurements on a geographic grid to estimate weather or make medical imaging predictions using multi-channel image intensity values ​​from a 2D or 3D scan. Such datasets can be difficult to work with because users may receive and write data at unpredictable intervals and at different scales, and they often want to run studies on multiple workstations simultaneously. Even a single set of data under these circumstances can require petabytes of storage.

Fundamental engineering problems in scientific computing related to the management and processing of huge data sets in neuroscience have already been solved thanks to Google’s TensorStore. TensorStore is an open-source C++ and Python software library developed by Google Research to solve the problem of storing and manipulating n-dimensional data. This library supports multiple storage systems such as Google Cloud Storage, local and network file systems, etc. It offers a unified API for reading and writing various types of arrays. With a strong atomicity, isolation, consistency, and durability (ACID) guarantee, the library also provides caching and read/write transactions. Optimistic concurrency ensures secure access from different processes and computers.

A simple Python API is available through TensorStore for loading and working with massive arrays of data. Arbitrarily large underlying datasets can be loaded and manipulated without storing the entire dataset in memory, as no actual data is read or held in memory until the precise slice is requested . This is possible with the indexing and manipulation syntax, which is much the same as that used for NumPy operations. Additional advanced indexing features supported by TensorStore include transformations, alignment, casting, and virtual views (data type conversion, downsampling, lazily generated arrays on the fly).

Large digital data sets require a lot of processing power to process and analyze. Usually this is accomplished by parallelizing operations between a large number of processor or accelerator cores distributed across multiple devices. Therefore, one of the main goals of TensorStore has been to allow parallel processing of individual datasets while maintaining high performance (i.e. reading and writing to TensorStore does not become a bottleneck during computation) and security (by preventing corruption or inconsistencies resulting from concurrent access patterns). TensorStore also has an asynchronous API that allows a read or write operation to continue in the background. At the same time, a program performs other tasks and customizable in-memory caching (which reduces slower storage system interactions for frequently accessed data). Optimistic concurrency ensures safety of parallel operations when many machines are accessing the same set of data. It maintains compatibility with various underlying storage layers without severely affecting performance. TensorStore has also been integrated with parallel computing frameworks like Apache Beam and Dask to enable distributed computing with TensorStore compatible with many current data processing workflows.

Exciting application cases for TensorStore include PaLM and other large, sophisticated language models. These neural networks test the limits of computing infrastructure with their hundreds of billions of parameters while demonstrating unexpected proficiency in creating and processing natural language. The efficiency of reading and writing the parameters of the model presents a difficulty during this training procedure. Although the training is distributed over many machines, it is necessary to regularly save the settings to a single checkpoint on a long-term storage system without slowing down the training process. These issues have already been resolved using TensorStore. It has been coupled with frameworks such as T5X and Pathways and used to control checkpoints connected to large-scale (“multipod”) models trained with JAX.

Brain mapping is another intriguing use case. Synapse-resolving connectomics aims to trace the complex network of individual synapses in animal and human brains. This requires petabyte-sized datasets, which are produced by imaging the brain at extremely high resolution covering fields of view down to millimeters or more. However, since they require millions of gigabytes to store, manipulate, and process the data in a coordinate system, current datasets present significant storage, manipulation, and processing challenges. With Google Cloud Storage serving as the underlying object storage system, TensorStore has been used to solve the computational challenges posed by some of the largest and most popular connectomics datasets.

To start with, Google Research has provided the TensorStore package which can be installed using simple commands. They have also released several API tutorials and documentation for future reference.

Github: https://github.com/google/tensorstore

Reference article: https://ai.googleblog.com/2022/09/tensorstore-for-high-performance.html

Refer to the API tutorials and documentation for usage details.

Please Don't Forget To Join Our ML Subreddit


Khushboo Gupta is an intern consultant at MarktechPost. She is currently pursuing her B.Tech from Indian Institute of Technology (IIT), Goa. She is passionate about the fields of machine learning, natural language processing and web development. She likes to learn more about the technical field by participating in several challenges.


Similar Posts

Leave a Reply

Your email address will not be published.