We know that using AI can cause hallucinations but it’s less clear how these manifest and what the pitfalls might be.  In most cases, hallucinations occur in large language models. They typically introduce facts learned from their training data, which is vastly unrelated to the source document.  AI systems lack emotional intelligence and human experience, which play vital roles in comprehending depth and meaning such as determining what should be included. Without these qualities, summaries can become shallow and disconnected from their source.

The common experience that we witness is that in paraphrasing original content, AI may produce summaries that contain words or numbers not present in the original text – which negatively impacts the quality and factuality of that result.

Consequently, If you need something to be 100% correct, AI summarisation isn’t the tool for the job. However, in cases where being 90% right is acceptable, these tools can be incredibly efficient and effective.  The clear conclusion is that using AI when you only need to be ‘mostly right’ – such as for discovery work – is a no brainer.  For other types of summaries then it’s more of a catalyst.

We think the biggest flaw in AI summarisation is that feedback is not just about words – the way people say things is as important as what they say.  Boil this down and the problem is that AI doesn’t understand sarcasm or multiple interpretations.  Emotion is often conveyed within textual responses such as the use of exclamation or question marks.  In this scenario, algorithmic compensation is not all that helpful.  Consider the implications of ‘shouting’ (using capitalisation) by participants if the resulting synthesis adds weight to those contributions in a summary.

Conversely, words could be hyper-important.  For example, in using technical language to express opinions about specialist subjects.  In this instance a general AI model might also be disappointing.  Lastly, the technology tends to prioritise word frequency over contextual relevance. In other words, important ideas might get overshadowed by less significant information simply because they appear more often in the text. 

Understanding colloquial language and context is another major challenge.  Participant feedback might include the use of slang or references to local landmarks, people or generally using words that only local people understand.  There is a similar problem with broken English- in that poor grammar or spelling ‘at source’ might result in eventual misinterpretation.

Finally there is the problem of a reduced ability to separate fact and opinion (making AI summarisation particularly problematic for synthesising the news).

While AI models are getting better, they are only as good as their training.  It follows that the flaws we have uncovered at the time of writing are not necessarily permanent.  There are some things that consultors can do to improve accuracy too.  For instance, most models have an accuracy of up to 95% when dealing with short sentences. Their accuracy significantly drops to about 60% with longer sentences.So perhaps it’s a good idea to use word limits after all.

We haven’t even got started about perceptions.  People may well reject the idea of AI analysis and subsequently boycott a consultation or demand that it is not enabled.  Some people are simply worried about long term conditioning.  That an ‘over-reliance’ on this technology may lead to an inability for humans to read and digest a ‘full version’ of events in the long term.

Ending on a more positive note, we think the benefits outweigh the disbenefits.  Especially if we understand and can compensate for the discussed flaws.  The remarkable capabilities of AI can transform the way we work and do helpful things other than summarisation – such as inject seldom heard perspectives into a debate or facilitate consensus-making on tough policy decisions. Â