Kirk, Scotty, and Spock

  2026-03-22

I love reading code for education, but code reviews are different. I’ve been doing them for over a decade, and I can testify to their virtues, but they just aren’t fun. Perusing code is to code reviews as devouring a nerdy magazine article is to ploughing through Terms and Conditions. It’s all reading in the end, but the latter is somehow less enticing.

The rise of AI has had two effects on code reviews: more code to review, and more subtly broken code. The outcome is the same: even less fun. This article argues that we can counter both effects if AI agents are to formally prove their work.

Good summaries

So he [Grandmaster] is looking at very little and seeing quite a lot.

Josh Waitzkin, The Art of Learning

Being more efficient at code reviews means doing less work. Instead of inspecting each line of a change, we seek a succinct, accurate summary. Three commonly used summary types are names, type signatures, and doc comments.

Names are succinct but unreliable. A name might tell you everything you need to know or lead you astray. Type signatures are succinct and accurate, but most type systems are too simple to restrict implementations sufficiently. Doc comments are infinitely expressive and can tell you things you can’t deduce from the code alone, but they’re even less reliable than names.

None of these summaries is trustworthy enough to skip reading the implementation details. Luckily, there is a better kind of summary: a specification. I don’t mean that lame SPEC.md file collecting dust in your repository or that design doc of yours that became stale before it got approved—I mean honest-to-god, machine-verifiable logic.

In the vast landscape of formal methods, two tools suggest realistic modes of interaction with agents: the lean theorem prover and the spark toolchain F* also looks promising, but I don’t have any experience with it. . Both separate interface and implementation and ensure they stay in sync.

lean represents mathematical theorems as types and their proofs as values inhabiting these types. Once you know that a proof exists, you never have to look at it. You can study it for insight, but if lean tells us that the proof has no holes, you can treat it as an opaque and reliable building block. So if you’re reviewing a contribution in lean, you can focus on the theorem and let the compiler handle all the drudgery My experience with proof assistants is that the proofs aren’t human-friendly anyway: it’s hard to follow a proof without stepping through it interactively. .

spark ecosystem features a similar abstraction. Its module system separates interface files (.ads) that contain public declarations and contracts from implementation files (.adb) that contain private declarations and function bodies. The GNATprove prover checks whether bodies fullfil their contracts. When the specification is precise, and the prover succeeds, the code is likely solid enough to run on a spaceship.

Even though lean and spark specifications are biased toward functional requirements (they focus on what the system does, not how; they can’t specify time complexity, for example), they are are both expressive and accurate summaries.

Scotty and Spock

I object to intellect without discipline.

Spock, The Squire of Gothos

When ChatGPT became a hit, I treated it as an abomination. It combined the worst parts of humans and computers: not only was it opaque and unreliable, but it also lacked common sense. Computers say no, and they are correct. ChatGPT says you’re absolutely right! and lies in your face with the confidence of a K-Pop star.

I became more tolerant with exposure and now treat traditional systems and llms as different thinking modes (System 2 vs. System 1, in Daniel Kahneman’s terms) or different characters. A prover evokes an image of Spock, the Science Officer of the uss Enterprise, while llm reminds me of the resourceful (and sometimes drunk) Chief Engineer Scotty. Together they can iron out details without bothering you, which gives you more time to sit in the chair and plan the next step.

Interactions between a human, an agent, and a prover form a double feedback loop: The human checks that the agent’s specification matches the intent, and the prover checks that the implementation matches the specification.

When all three of you are on board, the optimal workflow seems to be a double feedback loop. A human expresses an intent in natural language; an agent derives a specification and an implementation of that intent. One loop features the human commenting on the specification until it meets the intent, another features a prover commenting on the implementation until it meets the specification. Scotty gets caught between two fires, but, being a miracle worker, he always finds his way out.

Experiment: binary exponentiation

Beware of bugs in the above code; I have only proved it correct, not tried it.

The binary exponentiation algorithm has a special place in my heart because it helped me grok the power of loop invariants. I asked Claude Code (Opus 4.5 and 4.6) to implement it in spark. My understanding of spark is superficial: I can read it, but it would take me several hours to produce a verified version. Guided by GNATprove and my Socratic questions, Claude did it in about an hour, most of the time being blocked on the prover. Here is its code with my annotationsAll the credit is mine, all mistakes are Claude’s..

First, I decided to cheat and specify exponentiation in terms of the built-in power operator **. The fast_exp.ads file specifies the interface of the fast exponentiation package. Line 9 defines the postcondition of the Fast_Exp function (line 8): it must agree with the power operator.

package Fast_Exp
  with SPARK_Mode => On
is
   type Word is mod 2**64;

   --  Fast exponentiation using binary method (exponentiation by squaring).
   --  Computes Base**Exp in O(log Exp) multiplications.
   function Fast_Power (Base : Word; Exp : Natural) return Word
     with Post => Fast_Power'Result = Base ** Exp;

end Fast_Exp;

The fast_exp.adb file implements the Fast_Power function (line 13). The code is peppered with assertions that guide the prover. Lemma_Power_Sq is a helper ghost procedure (Ghost code facilitates proofs and doesn’t appear in the executable) that suggests a proof by induction for the even-exponent case (the prover can handle most of the details, including the base case).

package body Fast_Exp
  with SPARK_Mode => On
is

   procedure Lemma_Power_Sq (B : Word; E : Natural)
     with Ghost, Pre => E >= 2 and E mod 2 = 0,
          Subprogram_Variant => (Decreases => E),
          Post => B ** E = (B * B) ** (E / 2)
   is begin
      if E > 2 then Lemma_Power_Sq (B, E - 2); end if;
   end Lemma_Power_Sq;

   function Fast_Power (Base : Word; Exp : Natural) return Word is
      Result : Word := 1;
      B      : Word := Base;
      E      : Natural := Exp;
   begin
      while E > 0 loop
         pragma Loop_Invariant (Result * B ** E = Base ** Exp);
         pragma Loop_Variant (Decreases => E);

         if E mod 2 = 1 then
            pragma Assert (B ** E = B * B ** (E - 1));
            pragma Assert (Result * B ** E = Result * (B * B ** (E - 1)));
            pragma Assert (Result * (B * B ** (E - 1)) =
                          (Result * B) * B ** (E - 1));
            Result := Result * B;
            E := E - 1;
         end if;

         exit when E = 0;

         Lemma_Power_Sq (B, E);
         B := B * B;
         E := E / 2;
      end loop;

      return Result;
   end Fast_Power;

end Fast_Exp;

In my next prompt, I asked Claude to remove the dependency on the built-in operator and derive correctness from first principles. The new interface defines the ghost function Power (line 7) that models exponentiation for the Fast_Power function (line 14).

package Fast_Exp
  with SPARK_Mode => On
is
   type Word is mod 2**64;

   --  Ghost function for specification: computes base^exp recursively.
   function Power (Base : Word; Exp : Natural) return Word is
     (if Exp = 0 then 1 else Base * Power (Base, Exp - 1))
     with Ghost,
          Subprogram_Variant => (Decreases => Exp);

   --  Fast exponentiation using binary method (exponentiation by squaring).
   --  Computes Base^Exp in O(log Exp) multiplications.
   function Fast_Power (Base : Word; Exp : Natural) return Word
     with Post => Fast_Power'Result = Power (Base, Exp);

end Fast_Exp;

The shape of the package body stayed the same; the Lemma_Power_Sq (line 5) procedure got bulkier because the prover needed more help.

package body Fast_Exp
  with SPARK_Mode => On
is

   procedure Lemma_Power_Sq (B : Word; E : Natural)
     with Ghost, Pre => E >= 2 and E mod 2 = 0,
          Subprogram_Variant => (Decreases => E),
          Post => Power (B, E) = Power (B * B, E / 2)
   is
   begin
      if E = 2 then
         pragma Assert (Power (B, 1) = B);
         pragma Assert (Power (B * B, 1) = B * B);
      else
         Lemma_Power_Sq (B, E - 2);
         pragma Assert (Power (B, E - 1) = B * Power (B, E - 2));
         pragma Assert (Power (B, E) = B * Power (B, E - 1));
         pragma Assert (B * (B * Power (B, E - 2)) =
                       (B * B) * Power (B, E - 2));
         pragma Assert (Power (B, E) = (B * B) * Power (B, E - 2));
      end if;
   end Lemma_Power_Sq;

   function Fast_Power (Base : Word; Exp : Natural) return Word is
      Result : Word := 1;
      B      : Word := Base;
      E      : Natural := Exp;
   begin
      while E > 0 loop
         pragma Loop_Invariant (Result * Power (B, E) =
                                Power (Base, Exp));
         pragma Loop_Variant (Decreases => E);

         if E mod 2 = 1 then
            pragma Assert (Power (B, E) = B * Power (B, E - 1));
            pragma Assert (Result * (B * Power (B, E - 1)) =
                          (Result * B) * Power (B, E - 1));
            Result := Result * B;
            E := E - 1;
         end if;

         pragma Assert (Result * Power (B, E) = Power (Base, Exp));
         exit when E = 0;

         Lemma_Power_Sq (B, E);
         B := B * B;
         E := E / 2;
      end loop;

      return Result;
   end Fast_Power;

 end Fast_Exp;

Lastly, I asked Claude to generalize the algorithm to work on arbitrary monoids. Claude correctly parameterized the package:

generic
   type T is private;
   Identity : T;
   with function "*" (Left, Right : T) return T;
package Fast_Exp
  with SPARK_Mode => On
is
   procedure Axiom_Assoc (A, B, C : T) with
     Ghost, Import, Always_Terminates, Global => null,
     Post => A * (B * C) = (A * B) * C;

   procedure Axiom_Id_Right (A : T) with
     Ghost, Import, Always_Terminates, Global => null,
     Post => A * Identity = A;

   procedure Axiom_Id_Left (A : T) with
     Ghost, Import, Always_Terminates, Global => null,
     Post => Identity * A = A;

   function Power (Base : T; Exp : Natural) return T is
     (if Exp = 0 then Identity else Base * Power (Base, Exp - 1))
     with Ghost,
          Subprogram_Variant => (Decreases => Exp);

   function Fast_Power (Base : T; Exp : Natural) return T
     with Post => Fast_Power'Result = Power (Base, Exp);
end Fast_Exp;

The corresponding body is close to the previous iteration, so I omitted it. The prover can’t analyze generic code; it works only with specific instances. Claude checked the code by specializing the package for integer arithmetic.

with Fast_Exp;

package Mod64_Fast_Exp
  with SPARK_Mode => On
is
   type Word is mod 2**64;

   package Inst is new Fast_Exp
     (T => Word, Identity => 1, "*" => "*");

end Mod64_Fast_Exp;

Claude impressed me, but it required more guidance and time than I expected. For example, the body of its first attempt spanned over 130 lines of code, too much for an algorithm that fits on a business card. When Claude started from scratch, adding only what was needed to help the prover, the implementation shrank to 50 lines. Another problem was that Claude tried to shift work to the prover early in the session and raised the resource limits too high, so every invocation took several minutes. Dialing down the settings boosted progress.

I consider the experiment a success: the package interface constitutes about 25% of the code (about 10% if counting omitted test code) and summarizes it perfectly. During a code review, I would read only the specification and approve the rest after a quick glance to ensure the agent didn’t cheat.

Conclusion

Ain’t technology grand—solving problems we never should have had in the first place?

Gerald M. Weinberg, Weinberg on Writing

Agents are drowning humans in subtly broken code, but they can also alleviate the problems they created if we require them to provide machine-verifiable proofs. lean and spark, though unlikely to become mainstream, serve as models for future verified toolchains that should feature a clear separation between interface and implementation (so that people know where to look) and easy-to-use, fast provers (so that agents can iterate quickly).

Researchers build specialized systems (e.g., AlphaProof and Aristotle) that rely on lean to obtain verified results. Mathematicians (most notably, Terence Tao) join forces with llms and lean and formalize proofs faster than ever before. It’s time to bridge pure math and greasy engineering filled with concurrency and networking. Consumer-grade models can handle propositional logic, but tooling, libraries (verification always works bottom up), and education lag behind.

Scotty is on board. We’re waiting for you, Mr. Spock.

Similar articles