I have been following recent progress in the formalization of mathematical proofs in Lean, particularly in the context of large language models. One prominent advocate of this approach is Terence Tao, who regularly writes about developments in this area.
From a programmer's perspective, this puts up an interesting parallel. Models such as Sonnet or Opus 4.5 can generate thousands of lines of code per hour. I can review the output, ask the model to write tests, iterate on the result, and after several cycles become confident that the software is sufficiently correct.
For centuries, mathematicians developed proofs by hand, using pen and paper, and were able to check the proofs of their peers. In the context of LLMs, however, a new problem may arise. Consider an LLM that constructs a proof in Lean 4 iteratively over several weeks, resulting in more than 1,000,000 lines of Lean 4 code and concluding with a QED. At what point is an mathematician no longer able to confirm with confidence that the proof is correct?
Such a mathematician might rely on another LLM to review the proof, and that system might also report that it is correct. We may reach a stage where humans can no longer feasibly verify every proof produced by LLMs due to their length and complexity. Instead, we rely on the Lean compiler, which confirm formal correctness, and we are effectively required to trust the toolchain rather than our own direct understanding.
> more than 1,000,000 lines of Lean 4 code and concluding with a QED.
Usually the point of the proof is not to figure out whether a particular statement is true (which may be of little interest by itself, see Collatz conjecture), but to develop some good ideas _while_ proving that statement. So there's not much value in verified 1mil lines of Lean by itself.
You'd want to study the (Lean) proof hoping to find some kind of new math invented in it or a particular trick worth noticing.
LLM may first develop a proof in natural language, then prove its correctness while autoformalizing it in Lean. Maybe it will be worth something in that case.
No, the point of proofs in mathematics IS to prove a particular statement is true, given certain axioms (accepted truths). Yes, there are numerous benefits beyond demonstrating something is undeniably true, given certain accepted truths, perhaps more “useful” than the proof itself, but math is a method of formal knowledge that doesn’t accept shortcuts.
A lot of mathematicians (myself included) would say that the point of proofs isn’t entirely to know whether or not a statement is true, but that it exists to promote human understanding. In fact, I’d argue that at some level, knowing whether or not a theorem is true can be less important than understanding an argument.
This is why having multiple different proofs is valuable to the math community—because different proofs offer different perspectives and ways of understanding.
I am not a mathematician either, but having read Tao's essay on intuition (https://terrytao.wordpress.com/career-advice/theres-more-to-...) I can easily see how he would profitably point an LLM in the right direction and suggest approaches that would conclude in a successful proof upon connecting the dots.
You don't have to trust the full Lean toolchain, only the trusted proof kernel (which is small enough to be understood by a human) and any desugarings involved in converting the theorem statement from high-level Lean to whatever the proof kernel accepts (these should also be simple enough). The proof itself can be checked automatically.
Lets assume the kenerl is correct and a mathematian trusts it, now what happens if it is the proof itself that he cannot understand anymore because it has gotten too sophisticated? I think that this would indeed be a difference, and a crossing point in history of mathematical proofs.
In principle, it's no different than failing to understand the details of any calculation. If the calculating process is executed correctly, you can still trust the outcome.
I'm not sure I understand what you're getting at -- your last paragraph suggestions that you understand the point of formal specification languages and theorem provers (ie. for the automated prover to verify the proof such that you just have to trust the toolchain) but in your next to last paragraph you speak as if you think that human mathematicians need to verify the lean 4 code of the proof? It doesn't matter how many lines the proof is, a proof can only be constructed in lean if it's correct. (Well, assuming it's free of escape hatches like `sorry`).
> Well, assuming it's free of escape hatches like `sorry`
There are bugs in theorem provers, which means there might be "sorries", maybe even malicious ones (depending on what is at stake), that are not that easy to detect. Personally, I don't think that is much of a problem, as you should be able to come up with a "superlean" version of your theorem prover where correctness is easier to see, and then let the original prover export a proof that the superlean prover can check.
I think more of a concern is that mathematicians might not "understand" the proof anymore that the machine generated. This concern is not about the fact that the proof might be wrong although checked, but that the proof is correct, but cannot be "understood" by humans. I don't think that is too much of a concern either, as we can surely design the machine in a way that the generated proofs are modular, building up beautiful theories on their own.
A final concern might be that what gets lost is that humans understand what "understanding" means. I think that is the biggest concern, and I see it all the time when formalisation is discussed here on HN. Many here think that understanding is simply being able to follow the rules, and that rules are an arbitrary game. That is simply not true. Obviously not, because think about it, what does it mean to "correctly follow the rules"?
I think the way to address this final concern (and maybe the other concerns as well) is to put beauty at the heart of our theorem provers. We need beautiful proofs, written in a beautiful language, checked and created by a beautiful machine.
> Personally, I don't think that is much of a problem, as you should be able to come up with a "superlean" version of your theorem prover where correctness is easier to see, and then let the original prover export a proof that the superlean prover can check.
I think this is sort of how lean itself already works. It has a minimal trusted kernel that everything is forced through. Only the kernel has to be verified.
In principle, this is how these systems work. In practice, there are usually plenty of things that make it difficult to say for sure if you have a proof of something.
Understanding IMO is "developing a correct mental model of a concept".
Some heuristics of correctness:
Feynman: "What I cannot build. I do not understand"
Einstein: "If you can't explain it to a six year old, you don't understand it yourself"
Of course none of this changes anything around the machine generated proofs. The point of the proof is to communicate ideas; formalization and verification is simply a certificate showing that those ideas are worth checking out.
Ideas and correctness depend on each other. You usually start with an idea, and check if it is correct. If not, you adjust the idea until it becomes correct. Once you have a correct idea, you can go looking for more ideas based on this.
Formalisation and (formulating) ideas are not separate things, they are both mathematics. In particular, it is not that one should live in Lean, and the other one in blueprints.
Formalisation and verification are not simply certificates. For example, what language are you using for the formalisation? That influences how you can express your ideas formally. The more beautiful your language, the more the formal counter part can look like the original informal idea. This capability might actually be a way to define what it means for a language to be beautiful, together with simplicity.
I share your fascination with proof assistants and formal verification, but the reality is that I am yet to see an actual mathematician working on frontier research who is excited about formalizing their ideas, or enthusiastic about putting in the actual (additional) work to build the formalization prerequisites to even begin defining the theorem's statement in that (formal) language.
You know what? I agree with you. I have not formalised any of my stuff on abstraction logic [1] for that reason (although that would not be too difficult in Isabelle or Lean), I want to write it down in Practal [2], this becoming possible I see as the first serious milestone for Practal. Eventually, I want Practal to feel more natural than paper, and definitely more natural than LaTeX. That's the goal, and I feel many people now see that this will be possible with AI within the next decade.
>I am yet to see an actual mathematician working on frontier research who is excited about formalizing their ideas
British mathematician Kevin Buzzard has been evangelizing proof assistants since 2017. I'll leave it to you to decide whether he is working on frontier research:
Sure, he is one of biggest advocates for it, and yet he was quite clear that it is not yet possible for him to do his actual research in Lean.
Quoting one of the recent papers (2020):
> With current technology, it would take many person-decades to formalise Scholze’s results. Indeed,
even stating Scholze’s theorems would be an achievement. Before that, one has of course to formalise
the definition of a perfectoid space, and this is what we have done, using the Lean theorem prover.
One of my biggest points of criticism of Python is its slow cold start time. I especially notice this when I use it as a scripting language for CLIs. The startup time of a simple .py script can easily be in the 100 to 300 ms range, whereas a C, Rust, or Go program with the same functionality can start in under 10 ms. This becomes even more frustrating when piping several scripts together, because the accumulated startup latency adds up quickly.
Yes, that is also my feeling. But comparing an interpreted language with a compiled one is not really fair.
Here is my quick benchmark. I refrain from using Python for most scripting/prototyping task but really like Janet [0] - here is a comparison for printing the current time in Unix epoch:
$ hyperfine --shell=none --warmup 2 "python3 -c 'import time;print(time.time())'" "janet -e '(print (os/time))'"
Benchmark 1: python3 -c 'import time;print(time.time())'
Time (mean ± σ): 22.3 ms ± 0.9 ms [User: 12.1 ms, System: 4.2 ms]
Range (min … max): 20.8 ms … 25.6 ms 126 runs
Benchmark 2: janet -e '(print (os/time))'
Time (mean ± σ): 3.9 ms ± 0.2 ms [User: 1.2 ms, System: 0.5 ms]
Range (min … max): 3.6 ms … 5.1 ms 699 runs
Summary
'janet -e '(print (os/time))'' ran
5.75 ± 0.39 times faster than 'python3 -c 'import time;print(time.time())''
> The startup time of a simple .py script can easily be in the 100 to 300 ms range
I can't say I've ever experienced this. Are you sure it's not related to other things in the script?
I wrote a single file Python script, it's a few thousand lines long. It can process a 10,000 line CSV file and do a lot of calculations to the point where I wrote an entire CLI income / expense tracker with it[0].
The end to end time of the command takes 100ms to process those 10k lines, that's using `time` to measure it. That's on hardware from 2014 using Python 3.13 too. It takes ~550ms to fully process 100k lines as well. I spent zero time optimizing the script but did try to avoid common pitfalls (drastically nested loops, etc.).
> I can't say I've ever experienced this. Are you sure it's not related to other things in the script? I wrote a single file Python script, it's a few thousand lines long.
It's because of module imports, primarily and generally. It's worse with many small files than a few large ones (Python 3 adds a little additional overhead because of needing extra system calls and complexity in the import process, to handle `__pycache__` folders. A great way to demonstrate it is to ask pip to do something trivial (like `pip --version`, or `pip install` with no packages specified), or compare the performance of pip installed in a venv to pip used cross-environment (with `--python`). Pip imports literally hundreds of modules at startup, and hundreds more the first time it hits the network.
Yep, running time on my tool's --version takes 50ms and funny enough processing 10k CSV lines with ~2k lines of Python code takes 100ms, so 50ms of that is just Python preparing things to run by importing 20 or so standard library modules.
Exactly this. The time to start python is roughly a function of timeof(stat) * numberof(stat calls) and on a network system that can often be magnitudes larger than a local filesystem.
I do wonder, on a local filesystem, how much of the time is statting paths vs. reading the file contents vs. unmarshaling code objects. (Top-level code also runs when a module is imported, but the cost of that is of course highly module-dependent.)
Maybe you could take the stat timings, the read timings (both from strace) and somehow instrument Python to output timing for unmarshalling code (or just instrument everything in python).
Either way, at least on my system with cached file attributes, python can startup in 10ms, so it's not clear whether you truly need to optimize much more than that (by identifying remaining bits to optimize), versus solving the problem another way (not statting 500 files, most of which don't exist, every time you start up).
This benchmark is a little bit outdated but the problem remains the same.
Interpreter initialization: Python builds and initializes its entire virtual machine and built-in object structures at startup. Native programs already have their machine code ready and need very little runtime scaffolding.
Dynamic import system: Python’s module import machinery dynamically locates, loads, parses, compiles, and executes modules at runtime. A compiled binary has already linked its dependencies.
Heavy standard library usage: Many Python programs import large parts of the standard library or third-party packages at startup, each of which runs top-level initialization code.
This is especially noticeable if you do not run on an M1 Ultra, but on some slower hardware. From the results on Rasperberry PI 3:
C: 2.19 ms
Go: 4.10 ms
Python3: 197.79 ms
This is about 200ms startup latency for a print("Hello World!") in Python3.
Interesting. The tests use Python 3.6, which on my system replicates the huge difference shown in startup time using and not using `-S`. From 3.7 onwards, it makes a much smaller percentage change. There's also a noticeable difference the first time; I guess because of Linux caching various things. (That effect is much bigger with Rust executables, such as uv, in my testing.)
Anyway, your analysis of causes reads like something AI generated and pasted in. It's awkward in the context of the rest of your post, and 2 of the 3 points are clearly irrelevant to a "hello world" benchmark.
This is not an apples-to-apples comparison. Python needs to load and interpret the whole requests module when you run the above program. The golang linker does dead code elimination, so it probably doesn't run anything and doesn't actually do the import when you launch it.
Sure it's not an apples to apples comparison - python is interpreted and go is statically compiled. But that doesn't change the fact that in practice running a "simple" python program/script can take longer to startup than go can to run your entire program.
Even if you actually use the network module in Go, just so that the compiler wouldn't strip it away, you would still have a startup latency in Go way below 25 ms from my experience with writing CLI tools.
Whereas with Python, even in the latest version, you're already looking at atleast 10x the amount of startup latency in practice.
Note: This is excluding the actual time that is made for the network call, which can of course also add quiete some milliseconds, depending on how far on planet earth your destination is.
It's not interpreting- Python is loading the already byte compiled version.
But it's also statting several files (various extensions).
I believe in the past people have looked at putting the standard library in a zip file instead of splatted out into a bunch of files in a dirtree. In that case, I think python would just do a few stats, find the zipfile, loaded the whole thing into RAM, and then index into the file.
You should look at the self-executing .pex file format (https://docs.pex-tool.org/whatispex.html). The whole python program exists as a single file. You can also unzip the .pex and inspect the dependency tree.
It's tooling agnostic and there are a couple ways to generate them, but the easiest it to just use pants build.
Pants also does dependency traversal (that's the main reason we started using it, deploying a microservices monorepo) so it only packages the necessary modules.
I haven't profiled it yet for cold starts, maybe I'll test that real quick.
Edit: just ran it on a hello world with py3.14 on m3 macbook pro, about 100 +/-30 ms for `python -m hello` and 300-400 (but wild variance) for executing the pex with `./hello/binary.pex`.
I'm not sure if a pants expert could eke out more speed gains and I'm also not sure if this strategy would win out with a lot of dependencies. I'm guessing the time required to stat every imported file pales in comparison to the actual load time, and with pex, everything needs to be unzipped first.
Pex is honestly best when you want to build and distribute an application as a single file (there are flags to bundle the python interpreter too).
Now if I use `python -S` (disables `import site` on initialization), that gets down to ~15ms execution time for hello world. But that gain gets killed as soon as you start trying to import certain modules (there is a very limited set of modules you can work with and still keep speedup. So if you whole script is pure python with no imports, you could probably have a 20ms cold start).
Regarding cold-starts, I strongly believe V8 snapshots are perhaps not the best way to achieve fast cold starts with Python (they may be if you are tied to using V8, though!), and will have wide side effects if you go out of the standards packages included on the Pyodide bundle.
To put some perspective: V8 snapshots are storing the whole state of an application (including it's compiled modules). This means that for a Python package that is using Python (one wasm module) + Pydantic-core (one wasm module) + FastAPI... all of those will be included in one snapshot (as well as the application state). This makes sense for browsers, where you want to be able to inspect/recover everything at once.
The issue about this design is that the compiled artifacts and the application state are bundled into one piece artifact (this is not great for AOT designed runtimes, but might be the optimal design for JITs though).
Ideally, you would separate each of the compiled modules from the state of the application. When you do this, you have some advantages: you can deserialize the compiled modules in parallel, and untie the "deserialization" from recovering the state of the application. This design doesn't adapt that well into the V8 architecture (and how it compiles stuff) when JavaScript is the main driver of the execution, however it's ideal when you just use WebAssembly.
This is what we have done at Wasmer, which allows for much faster cold starts than 1 second. Because we cache each of the compiled modules separately, and recover the state of the application later, we can achieve cold-starts that are a magnitude faster than Cloudflare's state of the art (when using pydantic, fastapi and httpx).
If anyone is curious, here is a blogpost where we presented fast-cold starts for the application state (note that the deserialization technique for Wasm modules is applied automatically in Wasmer, and we don't showcase it on the blogpost): https://wasmer.io/posts/announcing-instaboot-instant-cold-st...
Note aside: congrats to the Cloudflare team on their work on Python on Workers, it's inspiring to all providers on the space... keep it up and let's keep challenging the status quo!
Big packages shouldn’t be imported until the cli has been parsed, and handed off to main. There’s been work to do this automatically, but it’s good hygiene to avoid it anyway.
A modern machine shouldn’t take this long, so likely something big is being imported unnecessarily at startup. If the big package itself is the issue, file it on their tracker.
Are you comparing the startup time of an interpreted language with the startup time of a compiled language? or you mean that `time python hello.py` > `( time gcc -O2 -o hello hello.c ) && ( time ./hello )` ?
Here's the thing - I don't really care if its' because the interpreter has to start up, or there's a remote http call, or we scan the disks for integrity - the end user experience on every run is slower.
The upcoming lazy import system may help with startup time…but if the underlying issue wasn’t “Python startup is slow” but rather “a specific program imports modules that take a long time to low”, it’ll only shift the time consumption to runtime.
That's totally fine, because many CLI tools are organized like `mytool subcommand --params=a,b...`, and breaking out those subcommands into their own modules and lazy loading everything (which good CLI tools already know to do) means unused code never gets imported.
You can already lazy import in python, but the new system makes the syntax sweeter and avoids having to have in-function `import module` calls, which some linters complain about.
At some point, no matter how something is mentioned, someone will offer criticism. I guess that in roughly 20% of all HN front page posts, at least one person comments on the terminology used. I do not see this as an argument against using accurate terminology, but rather as a reminder that it is impossible to meet everyone's expectations.
There are other terms that are similarly controversial, such as "thinking models". When you describe an LLM as "thinking", it often triggers debate because people interpret the term differently and bring their own expectations and assumptions into the discussion.
One problem is that the idea of being "well-structured" has gone overboard at some point over the past 20 years in many companies. As a result, many companies now operate highly convoluted monolithic systems that are extremely difficult to replace.
In contrast, a poorly designed microservice can be replaced much more easily. You can identify the worst-performing and most problematic microservices and replace them selectively.
> One problem is that the idea of being "well-structured" has gone overboard at some point over the past 20 years
That's exactly my experience. While a well-structured monolith is a good idea in theory, and I'm sure such examples exist in practice, that has never been the case in any of my jobs. Friends working at other companies report similar experiences.
Some people take Moore’s law in a strong sense: doubling rate is a constant. That is long dead.
But if we relax it to be a slowly varying constant, then it is not dead. That constant has been changed (by consensus) for a few times already.
Your mistake is to (1) take that constant literally (ie using the strong law) and (2) uses the boundary points to find the “average” effect. The latter is a really flawed argument as it cannot prove it hasn’t been dead (a recent effect) because you haven’t considered it’s change over time.
> It is not illegal for them to trade on their inside information. It is illegal for you to trade on their inside information. So they can send you their stock picks list and gloat about their ROI, but if you invest on the basis of the newsletter, you go to jail.
All the Congress member trades are public. There are even ETFs now that track the Congress member trades. So you can just buy such an ETF and have a one-to-one replication of the Congress member portfolio.
Food prep for the week, storing chopped vegetables, marinating small cuts of meat, storing baking ingredients like nuts, holding leftovers, portioning snacks, storing coffee beans or tea, ...
You should also buy the sealing lids with silicone gaskets.
From a programmer's perspective, this puts up an interesting parallel. Models such as Sonnet or Opus 4.5 can generate thousands of lines of code per hour. I can review the output, ask the model to write tests, iterate on the result, and after several cycles become confident that the software is sufficiently correct.
For centuries, mathematicians developed proofs by hand, using pen and paper, and were able to check the proofs of their peers. In the context of LLMs, however, a new problem may arise. Consider an LLM that constructs a proof in Lean 4 iteratively over several weeks, resulting in more than 1,000,000 lines of Lean 4 code and concluding with a QED. At what point is an mathematician no longer able to confirm with confidence that the proof is correct?
Such a mathematician might rely on another LLM to review the proof, and that system might also report that it is correct. We may reach a stage where humans can no longer feasibly verify every proof produced by LLMs due to their length and complexity. Instead, we rely on the Lean compiler, which confirm formal correctness, and we are effectively required to trust the toolchain rather than our own direct understanding.
reply