The shape of the speedup
Vectorisation replaces a Python interpreter loop with a single call into a typed-array kernel. The kernel iterates in C (or in SIMD assembly underneath C); each scalar operation pays no per-iteration dispatch cost. The result is one of the largest reproducible speedups in numerical Python — 10–200× on workloads that map cleanly onto the typed-array model.
The unit of work changes. A Python loop's unit is one bytecode dispatch per element. A vectorised call's unit is one kernel invocation for the whole array. The setup cost is paid once; the per-element cost collapses by two orders of magnitude.