I agree. We moved to Bazel at my last job and it probably took about 6 person months. I was most of that person and it includes other tooling not related to Bazel. Some extremely competent engineers also moved over our frontend code (including stuff like Cypress) and Python code (which needs to run against 3 different versions of Python). They had no Bazel experience beforehand and asked me maybe a couple hours worth of questions and just got it done. So I don't think you need to be a Bazel genius to get this done, but it helps to have someone with a Vision, which was me in this case. All in all, I'd do it again. I'm in the process of moving all my open source code to a monorepo (jrockway/monorepo, which really should have been called jrockway/jrockway) because the development experience is so much better.
The biggest goal for me in doing the project at work was because new employees couldn't run the code they were working on. We hired people. They tried. They weren't very productive. That's my fault, and I wanted to fix that while supporting our policy of "you can use any Linux distribution you want, or you can use an arm64 Mac". Many people suggested things like "force everyone to use NixOS" which I would be in favor of, but it wasn't the solution that won. (I honestly prefer Debian myself and didn't think that my preference should dictate how the team works. The fact that I disagreed with the proposed solution is a good indicator that people would be unhappy with anything I declared by fiat.) Rather, using Bazel to provide a framework for retrieving third-party tooling and also building our code was a comfortable compromise.
A secondary goal was test caching. If you edit README.md, CI doesn't need to rerun the Cypress tests. (As a corollary, if you edit "metadata_server.go", the "pfs_server.go" tests don't need to run, as the PFS server does not depend on the Metadata server.)
The biggest piece of slowness and complexity in the workflow would be building our code into a container to run in k8s. We used goreleaser and that involved building the code twice, one for each architecture, to build into an OCI image index which was our main release artifact. The usual shell scripts for local development just reused this and it was terribly slow. Throw in docker to do the builds, which deletes the Go build cache after every build, and you have a recipe for not getting anything done. Bazel is a much better way to build containers. Containers are just some JSON files and tar files. Bazel (rules_oci) just assembles your build artifacts into the necessary JSON files and tar files. To build a multi-architecture image index, you build twice and add a JSON file. Bazel handles this with platform transitions; you make a rule to build for the host architecture (technically transitioned to linux on a macos host), and then the image index rule just builds that target with two configurations (cross-compiled, not emulated with binfmt_misc like "docker buildx") and assembles the two artifacts into the desired multi-arch image. When running locally, you skip 2 of the 3 steps, just build for the host machine. Combined with proper build caching (thanks BuildBuddy!), this means that making an image to run in k8s takes about 10 seconds instead of 6 minutes. With the previous system you could try your code 80 times a day. With the new system, you could try it 2880 times ;) This increased productivity.
I also wrote a bunch of tools to make setting up k8s easier, which would have been perfectly possible without Bazel, but it helped. (Before, everyone pushed their built image to DockerHub and then reconfigured k8s to pull that. Now we have a local registry, and if two people do a build at the same time, you always get yours and not theirs. I did not design this previous system, I merely set out to fix it because it's Wrong.) Bazel makes vendoring tools pretty easy. For our product we needed things like kubectl, kind, skopeo, postgres, etc. These are all in //tools/whatever and can be run for your host machine with `bazel run //tools/whatever`. So once you ran my program to create and update your environment, you automatically had the right version of the tool to interact with it. We upgraded k8s regularly, nobody noticed. They would just get the new tool the next time they tried to run it. (A centrally managed linux distribution would do the same thing, but it couldn't revert you to an old tool when you checked out an old version to debug. A README with versions would work, but I learned that nobody really reads the READMEs until you ask them to. "How do I do X" "See this section of the README" "Oh damn I wish I thought of that" "Me too." ;)
The biggest problem I had with Bazel in the past was dealing with generated files. Editor support, "go get <our thing>", etc. I got by when I used Blaze at Google, but realistically, there was no editor support for Go at that time, so I didn't notice how badly it worked. There is now GOPACKAGESDRIVER, which technically helps, but it didn't work well for me and I wasn't going to inflict it upon my team. I punted this time and continued to check in generated files. We have a target //:make_proto that rebuilds the proto files, and a test that checks that you did it. You check in the generated protos when you change the protos. It works and I have a general rule for all generations like this. (We also generate a bunch of JSON from Jsonnet files; this mechanism helps with that.)
All in all, you can get a fresh machine, install git and Bazelisk, check out our repo, and get "bazel test ..." to succeed. That, to me, is the minimum dev experience that your employer owes you, and if you joined my team, you'd get it. That made me happy and it wouldn't be as good without Bazel. I'd do it again!
Just as an aside, after the Bazel conversion, I did a really complicated change, and Bazel didn't make it any harder. We made our main product depend on pg_dump, and adjusting the container building rules from "distroless plus a go binary" to "debian plus postgres plus a go binary" was pretty easy. rules_debian is very nifty, and it gives me a sense of reproducibility that I never got from "FROM debian:stable@sha256...; RUN apt-get update && apt-get install ....". Indeed, the reproducibility is there. You can take any release tag, "bazel build //oci:whatever" and see that the resulting multi-arch image has the same sha256 of what's on DockerHub. I couldn't have done that without Bazel, or at least not without writing a lot of code.
I don't work there anymore but I'm really happy about the project. I don't even do Build & Release stuff. I just add features to the product. But this needed to be done and I was happy to wear the hat.
As someone who is somewhat experienced with build systems in general (though not with Bazel) and has had to solve a lot of the issues you mentioned in different ways (i.e. without Bazel), I have been interested in learning Bazel for a long time as its building principles seem very sound to me. However, the few times I looked into it I found it rather impenetrable. In particular, defining build steps "declaratively" in Starlark to me just seemed to be a slightly less bad way of writing magic incantations in YAML. In other words, you still had to understand what exactly every magic encantation did under the hood and how to configure it, and documentation generally didn't seem great.
Is there some resource (blog/book/…) you can recommend for learning Bazel?
I feel like I got the basics from using Blaze for years at Google. Things like "oh yeah, buildifier will autoformat my BUILD files" and the basic flow of how a build system is supposed to work.
Figuring out how to complete a large project with Bazel involved a few skills that one should be ready to employ.
1) Programming. The stuff out there can't do things exactly the way you want. I wanted to use a bunch of golangci-lint checks with "nogo", so I opened up the golangci-lint source code and copy-pasted their code into my project to adapt the checks to how nogo works. People have tried fixing this problem generically before, but their solutions ended up not working and there are just a bunch of half-abandoned git repositories floating around that don't work. Write it yourself. (I had to write a lot of code for this project; compiling protos the way we want, producing reproducible tar files with more complex edits than I wanted to do with mtree -> awk -> bsd tar, installing built binaries, building "integration test" go coverage binaries, etc. Lots of code.)
2) Debugging. A lot happens behind the scenes and you always need to be situationally aware of what's being done for you. For example, I was pretty sure our containers would be "reproducible" i.e. have the same sha256 no matter the configuration of the build machine. That was ... not true. I tested it and it wasn't happening. So I had to dive into the depths of the outputs and see which bytes were "wrong" in which place, and then debug the code involved to fix the problem. (It was a success, and oddly I sent the PR to fix it about 5 seconds before someone else sent the exact same PR.)
3) Depth. There probably isn't a way to be functional where you pick something out of your search results, follow the quickstart, and then happily enjoy the results. Rather you should expect to read all of the documentation, then read most of the code, then check out the code and add a bunch of print statements... with each level of this involving some recursion to the same step for a sub-dependency. For example, I never really knew how "go build" worked, but needed to learn when I suspected linking time was too high. (Is it the same for 'go build'? Yes. Why? It's spending all of its time in 'gold'. What's gold, the go linker? No, it's the random thing Debian installed with gcc. Is there an alternative? Yes, lld and mold. Are those faster? Yes. How do I use one of those with Bazel? I'll add some print statements to rules_go and use that copy instead of the upstream one.)
With all that in mind, I never figured out "everything". There is a lot of stuff I took at face value, like configuration transitions for multi-arch builds. The build happens 3 times but we only build for 2 platforms (the third platform is the host machine). I don't know why or how to prevent the host build. (I did figure out how to do this for some platform-independent outputs, though, like generating static content with Hugo.) I also wrote a bunch of toolchains but never used Bazel's toolchain stuff. I had my works-with-5-lines-of-code way of running vendored tools for the host machine and never saw the need to type in 50 lines of boilerplate to do things the "right" way. I'm sure this will burn someone someday.
In the end, I guess motivation was the key. People on my team couldn't get their work done, and CI was so slow that people spent half their day in that cycle "I'm going to go read Reddit until CI is done". Hacks had been attempted in the past, and had a lot of effort put into them, and they still didn't work. So we had to rebuild the Universe from first principles, doing things the "right" way. The results were good.
I will always prefer this approach to the simpler ones. For one thing, Bazel always gives the "right answer" when it's set up correctly. It doesn't rely on developers to be experts at managing their dev machines; you include all the tools that they need and you can update them whenever you want a new feature, and they get it for free. That's the big selling point for me. I also can't deal with stuff that is obviously unnecessary, like how Dockerfile-based container builds require an ARM64 emulator to run "mkdir" in a Dockerfile. You're just generating a stack of tar files and some JSON. Let me just tell you where the tar files and the JSON is. We do not need a virtual machine here.
The biggest goal for me in doing the project at work was because new employees couldn't run the code they were working on. We hired people. They tried. They weren't very productive. That's my fault, and I wanted to fix that while supporting our policy of "you can use any Linux distribution you want, or you can use an arm64 Mac". Many people suggested things like "force everyone to use NixOS" which I would be in favor of, but it wasn't the solution that won. (I honestly prefer Debian myself and didn't think that my preference should dictate how the team works. The fact that I disagreed with the proposed solution is a good indicator that people would be unhappy with anything I declared by fiat.) Rather, using Bazel to provide a framework for retrieving third-party tooling and also building our code was a comfortable compromise.
A secondary goal was test caching. If you edit README.md, CI doesn't need to rerun the Cypress tests. (As a corollary, if you edit "metadata_server.go", the "pfs_server.go" tests don't need to run, as the PFS server does not depend on the Metadata server.)
The biggest piece of slowness and complexity in the workflow would be building our code into a container to run in k8s. We used goreleaser and that involved building the code twice, one for each architecture, to build into an OCI image index which was our main release artifact. The usual shell scripts for local development just reused this and it was terribly slow. Throw in docker to do the builds, which deletes the Go build cache after every build, and you have a recipe for not getting anything done. Bazel is a much better way to build containers. Containers are just some JSON files and tar files. Bazel (rules_oci) just assembles your build artifacts into the necessary JSON files and tar files. To build a multi-architecture image index, you build twice and add a JSON file. Bazel handles this with platform transitions; you make a rule to build for the host architecture (technically transitioned to linux on a macos host), and then the image index rule just builds that target with two configurations (cross-compiled, not emulated with binfmt_misc like "docker buildx") and assembles the two artifacts into the desired multi-arch image. When running locally, you skip 2 of the 3 steps, just build for the host machine. Combined with proper build caching (thanks BuildBuddy!), this means that making an image to run in k8s takes about 10 seconds instead of 6 minutes. With the previous system you could try your code 80 times a day. With the new system, you could try it 2880 times ;) This increased productivity.
I also wrote a bunch of tools to make setting up k8s easier, which would have been perfectly possible without Bazel, but it helped. (Before, everyone pushed their built image to DockerHub and then reconfigured k8s to pull that. Now we have a local registry, and if two people do a build at the same time, you always get yours and not theirs. I did not design this previous system, I merely set out to fix it because it's Wrong.) Bazel makes vendoring tools pretty easy. For our product we needed things like kubectl, kind, skopeo, postgres, etc. These are all in //tools/whatever and can be run for your host machine with `bazel run //tools/whatever`. So once you ran my program to create and update your environment, you automatically had the right version of the tool to interact with it. We upgraded k8s regularly, nobody noticed. They would just get the new tool the next time they tried to run it. (A centrally managed linux distribution would do the same thing, but it couldn't revert you to an old tool when you checked out an old version to debug. A README with versions would work, but I learned that nobody really reads the READMEs until you ask them to. "How do I do X" "See this section of the README" "Oh damn I wish I thought of that" "Me too." ;)
The biggest problem I had with Bazel in the past was dealing with generated files. Editor support, "go get <our thing>", etc. I got by when I used Blaze at Google, but realistically, there was no editor support for Go at that time, so I didn't notice how badly it worked. There is now GOPACKAGESDRIVER, which technically helps, but it didn't work well for me and I wasn't going to inflict it upon my team. I punted this time and continued to check in generated files. We have a target //:make_proto that rebuilds the proto files, and a test that checks that you did it. You check in the generated protos when you change the protos. It works and I have a general rule for all generations like this. (We also generate a bunch of JSON from Jsonnet files; this mechanism helps with that.)
All in all, you can get a fresh machine, install git and Bazelisk, check out our repo, and get "bazel test ..." to succeed. That, to me, is the minimum dev experience that your employer owes you, and if you joined my team, you'd get it. That made me happy and it wouldn't be as good without Bazel. I'd do it again!
Just as an aside, after the Bazel conversion, I did a really complicated change, and Bazel didn't make it any harder. We made our main product depend on pg_dump, and adjusting the container building rules from "distroless plus a go binary" to "debian plus postgres plus a go binary" was pretty easy. rules_debian is very nifty, and it gives me a sense of reproducibility that I never got from "FROM debian:stable@sha256...; RUN apt-get update && apt-get install ....". Indeed, the reproducibility is there. You can take any release tag, "bazel build //oci:whatever" and see that the resulting multi-arch image has the same sha256 of what's on DockerHub. I couldn't have done that without Bazel, or at least not without writing a lot of code.
I don't work there anymore but I'm really happy about the project. I don't even do Build & Release stuff. I just add features to the product. But this needed to be done and I was happy to wear the hat.