All Your AI Seems To Be Doing is Creating Data Landfill.

Jason Bell
7 min readMay 29, 2023
Generated by Midjourney (for irony). Prompt by author: “Marilyn Monroe wearing a 1950s floral dress, she stands on a modern landfill site made of small images generated by artificial intelligence but the size of polaroid photographs.”

For clarity: no GPT, this is all me.

Stop the Bus, I Want to Get Off!

This is not the post I wanted to be writing, or be the one to hold the torch of whining about Artificial Intelligence. Yet, here I am. I wonder if it’s my age, or the number of years I’ve been working in software, artificial intelligence and machine learning, but every platform is now filled with so much generated crap dressed up as the “potential of AI” that I’m just bored.

This article is not about my boredom, after some thinking over the last few days, I come with a warning.

We’re Creating a Landfill of Pointless Data

While the feats of “intelligence” from OpenAI’s ChatGPT and all the other large language models (LLMs) are totally impressive, the route to getting there is a lot of “better to seek forgiveness than ask for permission”, I’ve mentioned this a few times before in other articles.

The right to be forgotten is going to be very difficult on a trained model of that size. It’s not fine tuning, it’s regenerating, it’s costly and probably too costly against the cost of wanting to be forgotten.

The other side to this amazing model (and it is amazing) is that we’re creating so much questionable, untrue and pointless crap that we’re creating a self fulfilling prophecy of an incorrect data cycle for the future iterations of LLMs.

In the late eighties, during my only real formal education in computing (the rest has been good old learning and doing), we were taught all about GIGO: garbage in, garbage out. It seems we’ve forgotten this concept completely. The rise of the internet made us want to create more content, social media enabled us to share our thoughts, our feelings and more besides. With the launch of the iPhone the world was turned on its head and we were leaving vapour trails of data, our location, the angle the phone was in our pockets while we were sat in the toilet.

And all the while some of us knew the data would be used for profit. The issue was that the quote, “If it’s free, you are the product!”, became so cliche that no one cared to think about it. So the data kept piling up, the companies kept rubbing their hands an no one really questioned them.

Data is Business, Business is Data

This is the title of my main key talk, I’ve delivered it several times over the last few years. Each time it needs a little amending here and there. People always comment positively about the talk which is really nice, they’re shocked by the amount of paperwork Judith Duportail has generated via her Tinder usage, they’re surprised when I show them what Tesco/Dunn Humby has on them with their Clubcard usage.

Thing is, the data was pretty clean, perhaps not perfect but it was good to get insight from.

The next revision feels like a rewrite and non of it for the better, I’m just not seeing the benefits of generative AI anymore, unless you’re in marketing then it’s pretty good.

NIDevConf 2022 — Data is Business, Business is Data.

I care about good data, quality data, bias in data, ethics in data. I don’t particularly care about generative data that I can’t completely quantify or wonder if it’s output probabilities are even close. If I I’m training a model on utter pointless crap (UPC) then I’m wasting time, money and other people’s certainty and therefore my reputation is in question, and so is yours.

Synthetic data is partly a solution but not the total solution, I’ve been thinking about this more obviously as I own a company called Synthetica Data. But I’m rooted in data quality, not data magical thinking. And that’s what the current wave of AI now seems to be.

I Don’t Care About Your Dancing Statues

I survived the whole dancing baby thing through Ally McBeal, it was painful to endure, I still have nightmares how awful it all was. I’ve no idea if Calista Flockhart felt the same way, the baby ended up more famous than her, from where I was sat (which was usually folded inwards crying into my pizza). At least Peter Gabriel used plasticine and a real chicken in the Sledgehammer video.

I despair, I really do — but there are those that won’t have seen it.

I don’t really want to suffer it all again, Ally McBeal’s internal thoughts, this time though it’s on a much grander scale. If anything it feels wasteful, all that training and output time for something that’s got an internet half life of about four seconds.

Steve Jobs talking to Socrates, no thanks.

Bill Gates talking to Ghandi, no thanks.

Dancing statues… definitely not.

The dancing baby was one thing and everyone talked about it, now we have a billion things and only one person talking about it. From an energy consumption point of view it’s probably worse too. Image and video model training costs a fortune and has a serious impact on the environment. The original YOLO model took days to train.

The Environmental Impact of UPC.

I’ll be honest, the environmental impact of model training, it’s not something I’d really considered much. I’d read a few headlines in years gone by and agreed with them. If anything, I’d cut back on the way I trained models and how I used my data. I couldn’t measure my impact but felt good that I had at least attempted to do something.

Now we’re in the season of UPC (utter pointless crap) from AI, we have to look again at finding a way to measure our actions where the environment are concerned. Perhaps the question should be, “Do I really need to create this output, or just want to?”.

Fortunately I don’t have to look too far on the environmental considerations of AI, Clarissa Bell wrote a stunning piece, “ChatGPT & Rubik’s Cubes: AI and the Carbon Puzzle” for University of St Andrews Unearthed Magazine.

“We could have an optimum opportunity to talk about and propose new green designs and renewable energy for a lot of commonly used technology, as well as those to train monolithic algorithms”: Clarissa Bell

The Hangover is About Due

The European Union will vote in June about its AI regulation bill. This puts companies in the position of being responsible for their data usage, their references to training data and how models are used.

Implications could be huge, the fines even larger. The first warning shot was the EU’s fine on Meta for $1.2bn over GDPR violations about the use of customer data. The same thing could go with the AI bill as well.

The great hangover is coming where we have to remember that we can’t just do anything we want with AI, other people’s data and the outputs that are produced. It’s time to be a responsible citizen again.

Even when you think the old world can’t bite you on the arse/ass, the Supreme Court went in favour of the photographer Lynn Goldsmith over the use of her photographs by Andy Warhol. This puts huge implications on the likes of Midjourney with the imagery used to train its model.

I’ve made a poster for you, you can use it wherever you need a reminder.

Better Data, Better Models = Less Data Landfill

With better data quality we get better models, that’s the reality of all of this. With generative AI we run a high risk of creating crap to train better crap. We may as well call GPT-5, if it’s ever done, UPC-5: Utterly Pointless Crap 5.

The flywheel effect of incorrect data will cause a tsunami of incorrect predictions, bad advice, incorrect models, potential health issues from mis-diagnosis if we let things go too far.

It’s not the singularity that concerns me, it’s us! Some of us don’t know when to take two steps back, stop and think.

If you don’t want a data landfill of nonsense, crap and waste then stop what you are doing, get a pen and paper (Yes, a pen and paper), and make plans of what it really is you want AI to and what data you need to get you there.

I’ll make a bet with you, it won’t be from the outputs of ChatGPT or Midjourney, it’ll be a lot simpler than that.

--

--

Jason Bell

The Startup Quant and founder of ATXGV: Author of two machine learning books for Wiley Inc.