In-text citations: deceptively difficult for large-language models?

Setting the scene

Final draft, final check of your 60-reference review paper … and there it is: a jump from citation 16 to citation 18. You sigh, change the 18 to 17, and start renumbering every in-text citation that follows.

Then it gets worse.

You suddenly find two distinct references that are both cited as “21.” Palms sweaty, you reach for a notepad: okay – the original 21 becomes the new 20, and the second 21 stays 21 … wait, is this 21 the OG 21 or was this a 22 that I renumbered before noticing the duplicates?

If you’re an author or editor, chances are you’ve found yourself in this nightmare before.

But we live in the age of the large-language model (LLM), so you might think such nightmares are things of the past. Simply paste your manuscript, complete with citations, into the LLM of your choice and ask the model to clean everything up, right? In the words of ChatGPT: “I’ll do the rest — just drop in the text.”

I decided to put LLMs to the test. Using a sample manuscript, complete with in-text citations and a reference list, I removed the in-text citations for 12, 14 and 17 and ensured that 21 was shared by two different papers. For good measure, I also skipped 28.

The Tests

Let’s start with Ecosia. I asked its AI Chat to check my in-text citations.

Oh, I see. Well, I appreciate the honesty, Ecosia!

Okay, let’s try Google’s Gemini. After introducing Gemini to the manuscript, I asked Gemini to fix the problems it recognized.

Wow, too easy! Let me just do a quick check and … uh-oh.

As shown above, it seems Gemini inserted in-text citation 17 where 14 should go. I pointed this out to Gemini.

Promising words! I proceeded with my check, only to find that the problematic text was … unchanged!

Furthermore, the model, in a desperate attempt to please me, added random bold formatting to in-text citations later in the manuscript. Faith shaken, I proceeded to the famed ChatGPT. It had a strong start, clearing the area that tripped up Gemini. We made it all the way to the 20s, when …

Another slip up. I confronted ChatGPT.

Now, to give ChatGPT credit, it did get things right in round 2. But the manuscript was very simple, artificial even, as were the mistakes… If a careful second check by a human is necessary for even the simplest of drafts, what’s the point?

AI Cannot Wake You From Your Citation Nightmare – Only a Human Can

A true late-stage manuscript, where these types of errors are most likely to occur, usually involves multiple authors. Problematically, each author typically has their own unique flair for citation formatting. Trusting such a manuscript to an LLM sounds like, if I may quote ChatGPT once more, “a mismatch made in metadata hell.”

Let’s pause and imagine I had instead worked with a skilled human editor. Rather than rush into blindly renumbering references, they might have paused to notice that my duplicate citation was redundant anyway. Or perhaps a source I had leaned heavily on in the introduction wouldn’t hold up to the scrutiny of an expert in the field.

Ultimately, in-text citations are best handled by a human expert – not just for accuracy, but also for the valuable insight that comes naturally from human collaboration.

In Summary

Even well-known AI tools like ChatGPT, Gemini or Ecosia struggle with academic in-text citations: skipped numbers, duplicates, inconsistent formatting, etc.
Large-language models (LLMs) are untrustworthy even for simple manuscripts with few in-text citations
Multi-author, late-stage manuscripts with inconsistent styling? Use LLMs at your own peril!
Even halfway through 2025, LLMs still frequently make strange changes, or hallucinate them entirely.
In academic publishing, citation mistakes can lead to delays, reviewer criticism, or even rejection when the stakes are high
Skilled human editors provide not just accuracy, but insight: flagging weak sources, catching logical citation gaps and maintaining consistent formatting style.