Tensorflow is a great library for tensor operations, covering most of the functions necessary for effective tensor manimulation. The code is extremely well optimized and it automatically scale to take advantage of the hardware available.
For these resons it’s an appealing choice for writing parallel programs, but (as a software engineer) I would need some more informations on the computational cost of each operation for implementing it effectively in the programs.
Let’s make an example. I’m currently enrolled in a master degree in artificial intelligence, a few weeks ago we were given a reinforcement learning problem were we had to write an agent able to play a certain board game. In order to get all the data for training our agent we had to play a lot of games, but the simulation engine provided was too slow. For this reason we decided to rewrite completey the simulations parallelizing it in tensorflow, and we were able to run hundreds of tousands of simulations in paralell in approximately the same time. Writing all the optimized code was not easy but it could be done.
Doing this project we realized that some of the operations were way slower than others, for example al the segment operations (segment_sum, segment_max, … ) were way slower than creating a ragged tensor and applying a reduce sum on the ragged dimension, while still doing approximately the same thing.
Or in another case, trying to implement a sparsely connected neural network, we were not able to understand if it was faster computing the product using some sparse tensor, or using gather to get the slices we needed and applying the multiplication manually on neurons we needed or other.
A complete documentation of the library, expecially one that offers such low level operations, in my opinion should not omit the computational cost of the basic operations.
I’m aware of the fact that running the fact that running the code on many different architectures will result in completely different behaviours (running on a cpu is different than running on a gpu or tpu), but still I think a complete desctiption of the behaviour should be available for the advanced users. The two main point of interest of course are the time complexity and the space complexity (becaause if the gpu crashes for out of memory the fast code is of no use) of each basic tensorflow operation, highliting the differences among the architectures.
Maybe such resource is already availabe somewere, but in all my researches I wasn’t able to find it. If it’s now I think it would be a very usefull addition, even if only for a few specialized developer and researcher