Antigravity: from surprise to doubt

Monday, 16 February 2026

Page content

As effective as Antigravity may be, digging a little deeper reveals that the agent-based systems working inside it, while helpful and capable at answering many complex questions, are not exempt from the usual issues of the large language models (LLMs) we’ve been dealing with for the past three years.

Let’s look at some practical examples.

– Image generated by Google Gemini.

Example #1

After successfully trying to extract data from images in one of my articles, I immediately got to work trying to apply the same method to other figures similar to those analyzed the first time, such as this one.

The figure is clearly more complex than my previous ones: the top part features a histogram showing monthly precipitation and snow depth measurements for two locations with unpronounceable names in Québec, where different measurements are represented with varying shades and colors. Below it, and partially overlapping the histogram, there is a line graph with the average monthly temperatures for the two locations, with the vertical scale positioned on the right side.

Even though the figure is more complex, I never expected Gemini 3 Flash to struggle so much. It tries, and tries again, but every time something is off. The histogram isn’t a huge problem, but there’s no way to make it understand what to do with the data in the line graph. This issue persists even when I explain exactly where it should look to determine the temperature values for each month.

Prompt

Hint: To evaluate temperatures for a given month you should take a point on the temperature plot that is between the two precipitation bars

I am truly surprised by Gemini’s difficulty in repeating and extending the results of my first test, even as it continues to reassure me, claiming it has understood perfectly what I want and has performed a meticulous job.

Answer

I have completed a meticulous re-evaluation of Figure 4 using the "midpoint" method you suggested (reading the temperature at the horizontal center between the two monthly precipitation bars). Final Corrections Applied: [...]

Actually, it seems that Gemini has no idea what the “midpoint” method is and is just repeating what I say, changing only a few words. If it were otherwise, it would have resolved the issue in seconds, pulling out the right data just as it did the first time.

In the end, I give up. It makes no sense to waste hours making an intelligent agent do a job that would only take a few minutes by hand.

Example #2

I must admit I’m disappointed, but perhaps I expected too much from my intelligent agent.

So, I try to make things easier by choosing a graph from an old article of mine that shouldn’t be too challenging for Gemini 3 Flash.

Instead, it gets stuck and makes incomprehensible mistakes. With a little help, he manages to pull out the data shown in this graph, which still has several missing points (highlighted by the colored boxes) or, even worse, points added quite imaginatively (such as the one indicated by the arrow).

Another disappointment. Why use an intelligent agent if it’s faster to acquire the points by hand?

Example #3

But I don’t give up easily and try once more to get it to do something good. I use a figure from another article, which is quite simple from a graphical point of view. However, it would be useful to be able to acquire it automatically, as the amount of data it contains makes it difficult to digitize accurately by hand.

Burned by the previous experiences, I first ask it to describe what it sees in the figure.

Prompt

Now could you describe in simple words what you "see" in figure 1?

The response is long, precise, and almost flawless,

Answer

Looking at figure 252505_1_f1.jpeg, I can see:

Two side-by-side graphs showing current-voltage (I-V) characteristics:

(a) Left graph (T = 2.0 K):

Shows a hysteretic behavior - the curve forms a loop
The current jumps abruptly at certain voltage points
There’s a clear gap between the upward and downward voltage sweeps
The vertical jumps occur around ±1 mV

(b) Right graph (T = 5.8 K):

Shows non-hysteretic behavior - smooth, continuous curve

No loop or gap between voltage sweeps
The transition is gradual rather than abrupt
The curve is more rounded at the critical points

Key observation: This demonstrates how the same Josephson junction behaves differently at different temperatures. At the lower temperature (2.0 K), it shows hysteresis (memory effect), while at the higher temperature (5.8 K), the hysteresis disappears and the junction becomes “overdamped.”

Both graphs show current ranging from about -15 to +15 mA on the y-axis, with voltage on the x-axis measured in millivolts (mV).

but, in reality, there is an error at the end of the response that would render any attempt to digitise the data in the two graphs impossible.

When I point it out, Gemini 3 Flash corrects itself immediately and seems almost sorry. But, as we all know, LLMs are programmed to always please the user.

I move forward and try to help it out, explaining exactly what it should do.

Prompt

No problem. Now could you extract the values of voltage and current from the two images? Please produce a table for each image, and export the table as a csv file and a markdown file, with suffix `_table`.
Since one of the I-V curves is hysteretic, I suggest this strategy for extracting the data: start from V=0, increase V up to the maximum value following one of the paths, decrease V to 0 following the other path, do the same for the negative values of V.

But despite the suggestion, there’s no way to get it to do a decent job. The best I get is this, which at first glance looks like a great result.

It’s a pity, though, that there’s a serious and incomprehensible error, which I’ll leave it for you to discover.

In any case, the previous image is truly the best of the best of the best, and it was obtained after hours of failed attempts (like this initial attempt, or this, or this one) and of hints that went from vague to increasingly precise, much like with an unprepared student from whom you try to get a few words out. Not to mention responses that always tried to adapt to the current suggestion without ever, or almost ever, making use of what had been discussed previously.

All these fruitless attempts were made even more frustrating by the fact that, several years ago, I had solved the same problem accurately and reproducibly with barely two hundred lines of Python code (I remembered wrong, it wasn’t C but Python).

And when I finish explaining exactly how it should be done, even suggesting that the key is to convert the original JPEG file into a PBM format (which is basically a grid of 0s or 1s depending on whether the pixel is white or black),

Prompt

My suggestion is to change focus completely. If I were you I would:

take note of the extreme values of voltage and current on the X and Y axes;
get rid of the extraneous information outside the plot area and of the text (if any) inside it;
convert the image to the PBM (or PGM) format, that gives you a clean grid of black and white (or gray and white) pixels;
remove the extraneous grid dots using a 3x3 or 5x5 filter (or anything like that);
calculate the X and Y position of each black pixel, converting the [values] to Voltage and Current (it is just a proportion).

Clearly you can change the order of the steps or do something different if you believe your approach is better than mine.

there’s still no way to get it to do something useful.

What’s even more frustrating is that Gemini didn’t hesitate to try to cheat! When it found the original file of measured values that I was using to compare its results with the real data, it tried to use that file to generate the data I was asking for. Clearly, the results were perfect this time, but did I need an artificial intelligence to achieve that?

Further examples

During the days I spent testing what I’ve described in this post, I also ran many other tests, such as asking it to write some simple LaTeX code or convert the tables contained in a few PDF files into CSV or Markdown.

t handled LaTeX well, but often struggled with tables, even when the layout was simple. There’s no need to go into detail here, as there aren’t many differences compared to previous examples.

Conclusions

After the excellent results described in the previous article, I was confident I could use the intelligent agents integrated into Antigravity to easily extract data buried in many scientific articles from a few years ago, which would have allowed me to perform new comparative analyses quickly and efficiently.

Unfortunately, all the more in-depth tests described in this post turned out to be unsuccessful.

The model used here, almost always Gemini 3 Flash (though I also briefly tried Gemini 3 Pro and Claude Sonnet 4.5),¹ showed inconsistent and unrepeatable behavior. For example, it would use different methods to solve the same problem without it ever being clear whether the choice of an algorithm or library had a specific technical motivation or was just a matter of statistical probability in the generation of responses.

Another serious issue was the lack of memory, much like Dory, the funny fish from the cartoons. Despite successfully solving the problem of extracting data from a histogram just a few days earlier, Gemini was unable to apply the same methodology to a similar case. On other occasions, once a problem was partially solved, Gemini would start working on it again from scratch instead of refining and improving the previous answer.

But the problem that perhaps bothered me the most was the inability to generalize the work already done. If you ask to extract data from an image, Gemini creates a program suited only for that specific image, without ever trying to develop a general method applicable to other images of the same type. Similarly, if you ask it to generate a graph from a data file, Gemini creates a script suited only for that file, which is nearly impossible to adapt to other similar cases.

When humans tackle a problem, they tend to become fond of the solution and try to reuse and extend it to similar problems. An LLM doesn’t care; it reinvents the wheel at every new iteration. It seems the triumph of the throwaway culture, a waste of resources and intelligence (however artificial) taken to the extreme.

I’m just a guy playing around with these tools. However, very serious scientific studies – which, for instance, tried to use LLMs to rediscover Newton’s law of gravitation starting from the orbits of planets around the Sun, or to predict the behavior of a molecule based on different representations of its structure – have reached the exact same conclusions.

Our analysis reveals that these models instead behave as if they develop task-specific heuristics that fail to generalize.²

Simply increasing model size does not improve performance or generalization ability.³

Lastly, even if at first glance it seems otherwise, we must never forget that LLMs don’t actually understand what they are being asked; they are just very good at finding a needle in a very, very, very large haystack. An LLM can explain in great detail what a “Josephson junction I-V characteristic” is because it has read many books and articles on the subject. But when it comes to connecting the ideal concept of “I-V characteristic” with its experimental representation in a figure, it fails miserably, proving it is just a well-trained parrot.

Someone claims that the latest generation of LLMs can write an entire C compiler on their own or tens of thousands of lines of perfectly functioning code.

Personally, I would be satisfied if these models could produce a data file or a table in a reproducible manner, without running into inexplicable errors or inconsistent results. They should also avoid awkward situations like the ones I experienced, where one day everything goes smoothly, and the next day, nothing works.

I also used the paid version of Gemini, kindly provided by my university, without noticing any differences compared to the free version. Those who claim that paid models do things the free ones can only dream of should provide solid evidence before being believed. ↩︎
K. Vafa et al., What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models, arXiv 2507.06952v4, 2025. ↩︎
B. Yan et al., Inconsistency of LLMs in molecular representations, Digital Discovery vol. 4, 2025. ↩︎

Antigravity: from surprise to doubt

Example #1

Example #2

Example #3

Further examples

Conclusions

Comments

Add a comment

@name