Computational Linguistics: Detecting AI-Generated Text

Last Updated on December 30, 2023 by Editorial Team

Author(s): Matteo Consoli

Originally published on Towards AI.

AI content indicators: ASL, readability, simplicity, and burstiness.

Introduction

Every time I read something on Medium or LinkedIn, I can’t stop thinking whether it’s written by a human or it’s a text generated by AI.
I’m developing a sixth sense to detect AI-generated content by smelling it. I might define myself as a truffle dog of the third millennium.
Universally, we can all agree that my instinct is not a scientific and reproducible method.

A *truffle-dog* according to AI. No complaints, it’s so cute! — playground.com

There are various tools online that quite effectively can identify AI texts. These tools, certainly, are not trained on my instinct. So, how exactly do they detect AI content?

This article introduces a few computational linguistic analyses that can help categorize text as human-written or AI-generated: average sentence length, readability, simplicity, and burstiness.

AI-Content Indicators (ACI)

In a world crowded with KPIs (Key Performance Indicators) and KRIs (Key Risk Indicators), I couldn’t miss the opportunity to define my ACIs: AI Content Indicators.
The analyses mentioned above have been part of computational linguistics for a long time already. Today, they can play a role also in determining the probability of a text being AI-generated.

Average Sentence Length

Description: The Average Sentence Length (ASL) provides the average number of words composing sentences in a given text.
AI-Content-Indicator: Higher ASL may be indicative of certain text generation methods. The ASL for a human being is between 15–20 words.

ChatGPT answered with a long 57 words sentence, contributing to a high score for this parameter.

Formula & Python Code

Readability

Description: Readability measures how easily text can be understood. It considers factors like sentence length, word complexity, and syllable count.
The formula I’m sharing is the Flesch reading-ease score (FRES). Don’t confuse it with the Flesch–Kincaid grade level that has, ultimately, a similar objective. There are more than 200 different formulas to calculate text readability.
AI-Content-Indicator: Higher FRES values indicate straightforward text and lower values suggest complexity. AI tends to write unfairly complex sentences, especially when it comes to scientific/tech domains.

FRES score for the sentence generated by ChatGPT is around 26.
The description of readability, written at the beginning of this paragraph, has a score of 70.

Formula & Python Code

Simplicity Score

Definition: Simplicity can be defined as an umbrella of features analyzing how “simple” a text is. Among all the metrics, it’s worth mentioning: Lexical density and Syntactic Complexity.
The first one is based on content-carrying words (nouns, verbs, adjectives, and adverbs) against the total number of words in a text. This means that the output of this function can be a number between 0 and 1. A higher lexical density (close to 1) indicates greater complexity.
The second one is a similar concept scaled to the text structure and it considers the number of clauses over the total number of sentences in a given text.
AI-Content-Indicator: Lower simplicity scores indicate simpler language. AI output has often a low simplicity score given by high lexical density and high syntactic complexity.

The average Lexical Density for a human being is around 0.5/0.6, while in the example above, it is close to 1.0. The syntactic complexity of the paragraph generated by ChatGPT is, in this example, within the human average of 3–5.

Formula & Python Code

Burstiness

Description: Burstiness analysis evaluates the occurrence of specific words or sentences against their expected frequency.
AI-Content-Indicator: language patterns or repeated syntactic structures are often used by AI.

In the example above, I asked ChatGPT to write a paragraph about data science with high burstiness. The terms data science, algorithms, neural networks, and innovation occur frequently, increasing the burstiness of “ML domain-specific language”.

Calculating burstiness is not as simple as the other ACIs mentioned.
The main takeaway is that human text tends to be more discontinuous and this corresponds to higher cross-sentence variation compared to a text written by an AI.

Conclusion

AI is, unanimously, an always-evolving domain. AI models are enhanced day by day, generating text that mimics more and more statistical patterns of human writing. One day not even my truffle-dog sixth sense will detect AI content anymore (if you skipped the intro, give it a look to catch this!).

People wondering if what they are reading is written by human or generated by AI — playground.com

Am I against AI content?
Absolutely not. I’m not a native English speaker and I use regularly AI to make sure that my content don’t contain grammar errors and that the text is clear and simple. I used it even for the article you are reading right now. Nevertheless, as a reader, I prefer to dedicate my time to enjoying the writers’ flows, their unique styles, and their personal experiences (and not the pre-packaged scenarios that an AI can offer).
Is a combination of the ACIs described above enough to detect AI content?
Definitely not. Instead, what is described above could help you use AI to generate more human-friendly output (e.g. “Ehi ChatGPT, talk to me about XYZ, but keep the ASL low and FRES high”).
A good AI detector might still detect it though, hence, don’t try to play this game for your college essays. U+1F606
Are the AI detectors in the market reliable?
Some are pretty good although they might provide false positives or false negatives. Overall, they are good tools while scrutinizing Medium and LinkedIn posts, especially when these contents are fully AI-generated.

Fun fact: I read about the Constitution of the United States being categorized as AI generated by an AI detector. I tremendously doubt it was written using AI back in the day. Let’s not forget to use AI detectors wisely.

George Washington using ChatGPT — playground.com

The views and opinions expressed in this article are my own and not those of any of my current, previous, or future employers.
Unless otherwise noted, all images are by the author.

Additional Resources & Bibliography

Readability, Wikipedia
Flesh Kinkaid Readability Tests, Wikipedia
Accounting for Word Burstiness in Topic Models by G. Doyle and C. Elkan, published at ICML in 2009 pdf
AI Thinks the Constitution was made by AI, New York Post (25 July 2023)
AI Content Detectors: GPT Zero, CopyLeaks

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Computational Linguistics: Detecting AI-Generated Text

Author(s): Matteo Consoli

AI content indicators: ASL, readability, simplicity, and burstiness.

Introduction

AI-Content Indicators (ACI)

Average Sentence Length

Readability

Simplicity Score

Burstiness

Conclusion

Additional Resources & Bibliography

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

7 Counterintuitive and Non-intuitive Probability Problems

TAI 134: The US Reveals Its New Regulations for the Diffusion of Advanced AI

Multi-Agent AI: From Isolated Agents to Cooperative Ecosystems

Inside rStar-Math, a Technique that Makes Small Models Math GPT-o1 in Math Reasoning

Multi-Class Classification VS Multi-Label Classification

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Computational Linguistics: Detecting AI-Generated Text

Author(s): Matteo Consoli

AI content indicators: ASL, readability, simplicity, and burstiness.

Introduction

AI-Content Indicators (ACI)

Average Sentence Length

Readability

Simplicity Score

Burstiness

Conclusion

Additional Resources & Bibliography

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement