Actually, let’s run a longer experiment and track changes since the first commit of this week:
...$ bazel clean --expunge
...$ rm -rf ~/bazel_cache/
...$ for commit in $(git log --first-parent --pretty=oneline ...623ed300b593b368e665899be3cf080c5a0e3ebe | tac | cut -d' ' -f1)
> do
> git checkout ${commit} &> /dev/null
> echo "Building at ${commit}"
> (time bazel build --disk_cache=~/bazel_cache //tensorflow/tools/pip_package:build_pip_package >/dev/null) 2>&1 | grep real
> done | tee ~/build_log
This is 273 commits. At the end of the job, the build cache has 124GB (something most people in OSS cannot afford either):
...$ du -sh ~/bazel_cache/
124G /usr/local/google/home/mihaimaruseac/bazel_cache/
Anyway, let’s look at the timing info from the log:
# concatenate the lines, transform xmys time into (60*x+y) seconds
...$ awk '!(NR%2) { print p, $2} { p = $3 }' ~/build_log | sed -e 's/m/ /' -e 's/s$//' | awk '{ print $1, ($2 * 60 + $3) }' > ~/times_in_seconds
# get a histogram binned for every 10 seconds
...$ sort -rnk2 ~/times_in_seconds | cut -d. -f1 | cut -d' ' -f2 | sed -e 's/.$//' | uniq -c | perl -lane 'print $F[1], "0-", $F[1], "9\t", "=" x ($F[0] / 2)'
1180-1189
580-589
570-579
550-559 =
520-529 =
430-439
360-369
280-289
270-279
240-249
230-239
210-219 =
170-179 =
160-169 =
140-149
120-129 ==
110-119 =
90-99
80-89 =
70-79 ===
60-69 ==
50-59 ===
40-49 ==========
30-39 =====================================================================================================
# also print the values instead of just ====s
...$ sort -rnk2 ~/times_in_seconds | cut -d. -f1 | cut -d' ' -f2 | sed -e 's/.$//' | uniq -c
1 118
1 58
1 57
2 55
2 52
1 43
1 36
1 28
1 27
1 24
1 23
2 21
2 17
3 16
1 14
4 12
3 11
1 9
3 8
7 7
5 6
7 5
20 4
202 3
As you see, most incremental builds take 30-40 seconds (202 out of 273!) but there are some that take much longer. Let’s look into them
# longest 20 times
...$ sort -rnk2 ~/times_in_seconds | head -n20
699c63cf6b0136a330ae8c5f56a2087361f6701e 1184.46
b836deac35cd58c271aebbebdc6b0bd13a058585 585.35
abcced051cb1bd8fb05046ac3b6023a7ebcc4578 574.188
b49d731332e5d9929acc9bfc9aed88ace61b6d18 556.711
42f72014a24e218a836a87452885359919866b0b 553.296
982608d75d1493b4e351ef84d58bc0fdf78203c8 527.231
a868b0d057b34dbd487a1e3d2b08d5489651b3ff 523.162
c432f62159879d83e62d72afc9ef80cb6cdbe1e5 433.18
8b05b58c7c9cb8d1ed838a3157ddda8694c028f4 366.548
36931bae2a36efda71f96c9e879e91b087874e89 280.591
b71106370c45bd584ffbdde02be21d35b882d9ee 272.807
86fb36271f9068f84ddcecae74fe0b7df9ce83ee 242.273
1848375d184177741de4dfa4b65e497b868283cd 239.788
9770c84ea45587524e16de233d3cf8b258a9bd77 219.21
61bcb9df099b3be7dfbbbba051ca007032bfb777 214.006
d3a17786019d534fb7a112dcda5583b8fd6e7a62 172.092
e8dc63704c88007ee4713076605c90188d66f3d2 170.582
ddcc48f003e6fe233a6d63d3d3f5fde9f17404f1 169.959
2035c4acc478b475c149f9be4f2209531d3d2d0d 169.84
3edbbc918a940162fc9ae4d69bba0fff86db9ca2 167.948
# what are the commits for each one
...$ for commit in $(sort -rnk2 ~/times_in_seconds | head -n20 | awk '{ print $1 }'); do git log -n1 --pretty=oneline ${commit}; done
699c63cf6b0136a330ae8c5f56a2087361f6701e use tensorflow==2.5.0 to temporarily solve the failure of `evaluate_tflite` function.
b836deac35cd58c271aebbebdc6b0bd13a058585 Remove TensorShape dependency from ScopedMemoryDebugAnnotation.
abcced051cb1bd8fb05046ac3b6023a7ebcc4578 Prevent crashes when loading tensor slices with unsupported types.
b49d731332e5d9929acc9bfc9aed88ace61b6d18 Integrate LLVM at llvm/llvm-project@955b91c19c00
42f72014a24e218a836a87452885359919866b0b Remove experimental flag `fetch_remote_devices_in_multi_client`.
982608d75d1493b4e351ef84d58bc0fdf78203c8 Switched to OSS llvm build rules instead of scripts imported from third_party.
a868b0d057b34dbd487a1e3d2b08d5489651b3ff Integrate LLVM at llvm/llvm-project@fe611b1da84b
c432f62159879d83e62d72afc9ef80cb6cdbe1e5 Integrate LLVM at llvm/llvm-project@b52171629f56
8b05b58c7c9cb8d1ed838a3157ddda8694c028f4 Integrate LLVM at llvm/llvm-project@8c3886b0ec98
36931bae2a36efda71f96c9e879e91b087874e89 Integrate LLVM at llvm/llvm-project@4b4bc1ea16de
b71106370c45bd584ffbdde02be21d35b882d9ee Integrate LLVM at llvm/llvm-project@bd7ece4e063e
86fb36271f9068f84ddcecae74fe0b7df9ce83ee Integrate LLVM at llvm/llvm-project@fda176892e64
1848375d184177741de4dfa4b65e497b868283cd Merge pull request #51511 from PragmaTwice:patch-1
9770c84ea45587524e16de233d3cf8b258a9bd77 Integrate LLVM at llvm/llvm-project@cc4bfd7f59d5
61bcb9df099b3be7dfbbbba051ca007032bfb777 Integrate LLVM at llvm/llvm-project@8e284be04f2c
d3a17786019d534fb7a112dcda5583b8fd6e7a62 Fix and resubmit subgroup change
e8dc63704c88007ee4713076605c90188d66f3d2 Add BuildTensorSlice for building from unvalidated TensorSliceProtos.
ddcc48f003e6fe233a6d63d3d3f5fde9f17404f1 [XLA:SPMD] Improve partial manual sharding handling. - Main change: make sharding propagation work natively with manual subgroup sharding. There were some problems when propagating with tuple shapes. This also avoids many copies, which is important for performance since the pass runs multiple times. - Normalize HloSharding::Subgroup() to merge the same type of subgroup dims. - Handle tuple-shaped ops (e.g., argmax as reduce, sort) in SPMD partitioner. - Make SPMD partitioner to handle pass-through ops (e.g., tuple) natively, since they can mix partial and non-partial elements in a tuple.
2035c4acc478b475c149f9be4f2209531d3d2d0d Legalizes GatherOp via canonicalization to GatherV2Op; i.e. Providing default values of 0 for the axis parameter and the batch_dims attribute.
3edbbc918a940162fc9ae4d69bba0fff86db9ca2 Internal change
10 of these 20 commits are LLVM hash bumps. In total, there are 11 such commits in the 273 considered:
...$ for commit in $(cat ~/times_in_seconds | awk '{ print $1 }'); do git log -n1 --pretty=oneline ${commit}; done | grep LLVM | wc -l
11
So, almost all LLVM commits result in large compile times. Half of the top 20 longest compile times are LLVM hash bumps
I’d say this is quite costly and we need to find a plan to handle this in a way that helps OSS users.
Edit: Actually ALL LLVM hash bumps are included in the longest compiles, the missing one is just the conversion to upstream files:
...$ for commit in $(cat ~/times_in_seconds | awk '{ print $1 }'); do git log -n1 --pretty=oneline ${commit}; done | grep LLVM
b49d731332e5d9929acc9bfc9aed88ace61b6d18 Integrate LLVM at llvm/llvm-project@955b91c19c00
3487b91d529f2cbc412121d60845cda014e0db7d Integrate LLVM at llvm/llvm-project@9cdd4ea06f09
c432f62159879d83e62d72afc9ef80cb6cdbe1e5 Integrate LLVM at llvm/llvm-project@b52171629f56
86fb36271f9068f84ddcecae74fe0b7df9ce83ee Integrate LLVM at llvm/llvm-project@fda176892e64
36931bae2a36efda71f96c9e879e91b087874e89 Integrate LLVM at llvm/llvm-project@4b4bc1ea16de
e624ad903f9c796a98bd309268ccfca5e7a9c19a Use upstream LLVM Bazel build rules
8b05b58c7c9cb8d1ed838a3157ddda8694c028f4 Integrate LLVM at llvm/llvm-project@8c3886b0ec98
9770c84ea45587524e16de233d3cf8b258a9bd77 Integrate LLVM at llvm/llvm-project@cc4bfd7f59d5
b71106370c45bd584ffbdde02be21d35b882d9ee Integrate LLVM at llvm/llvm-project@bd7ece4e063e
a868b0d057b34dbd487a1e3d2b08d5489651b3ff Integrate LLVM at llvm/llvm-project@fe611b1da84b
61bcb9df099b3be7dfbbbba051ca007032bfb777 Integrate LLVM at llvm/llvm-project@8e284be04f2c