Recent reporting by Nieman Lab describes how some major news organizations—including The Guardian, The New York Times, and Reddit—are limiting or blocking access to their content in the Internet Archive’s Wayback Machine. As stated in the article, these organizations are blocking access largely out of concern that generative AI companies are using the Wayback Machine as a backdoor for large-scale scraping.
These concerns are understandable, but unfounded. The Wayback Machine is not intended to be a backdoor for large-scale commercial scraping and, like others on the web today, we expend significant time and effort working to prevent such abuse. Whatever legitimate concerns people may have about generative AI, libraries are not the problem, and blocking access to web archives is not the solution; doing so risks serious harm to the public record.
Knowledge rot is already a problem and has been for years – where you try to follow some links only to find they’re dead, or people deleted their content. The anecdotes of finding some old problem and someone just said “I figured it out”. Sure, archival won’t fix that specific example, but the principle is there - we lose so much information.
It would be nice if we had a government that worked for We the People and made information archival mandatory — likr the Library of Congress already does with printed materials.
Precisely that one, yes :)
Yes, so there was a time when I was dreaming day and night about something like those LLMs, but for archiving knowledge. That is, archiving existing statements with subjects and objects and relations, a bit more high-level and less generalized than LLMs. Syllogisms, semantic relationships, distances in application. Sort of what holocrons are in Star Wars.
So kinda like an ethical LLM[1]. I’d be on board with that.
I know it’s unpopular to say, but I’ve found the latest version of Gemini to be pretty useful. But you have to know what they’re good for and not. General knowledge? Generally pretty decent. But you have to ask for sources and check those sources, and don’t tell it what you think, ask it what it knows and to admit when it doesn’t know things. I wouldn’t put my life on the line, but for looking up random stuff, it’s pretty decent.
I know LLMs will get worse and shittier, which I think is a bummer, because they could be so damned useful.
But I get your distinctions and I’m on board with that. It’d be nice! ↩︎
It would be similar to an ethical LLM, but the question is not in ethics, it’s in having more structure. Sort of granularity. That could allow to scrape knowledge and reproduce it in some way better than just an LLM output. Such a thing could be both a model and an associative dictionary, a bit like automated Wikipedia.
I found it to be just Google made more convenient, which is good, but not there yet.
I know LLMs will get worse and shittier, which I think is a bummer, because they could be so damned useful.
Why would they? Humans keep producing new data. Old datasets will get less useful. They do all the time. And the old approach to training. But fundamentally they shouldn’t get worse.
more structure. [etc, trimmed quote]
I’m on board with wanting this :)
LLMs will get worse and shittier Why would they?
Not from the side of them gaining more knowledge but from the side of companies creating them monetizing and otherwise enshittifying them.
If we had a competitive open-source LLM…
So you’re not wrong, I agree; but I was speaking of a different angle. heh
Ah, in that dimension what I see seems similar to oil processing, again. They are generally all similar. Better datasets - better output. A natural curve of expenses and results.
A competitive open-source LLM makes sense ; but the real asset is data. So said LLM will be hosted (or provided with computing power) commercially to work on said processed data, usually. There are no anarchist free gas stations, and just like that it will be a building block of businesses.
I suppose the real issue is paying for the servers. There’s already pushback against the datacenters needed to power LLMs as it is. I suppose the capital to build would have to come from somewhere.
It’s a pity we don’t have a good government for a project like that. That would truly be a public service.
Did some calculations recently. If we took the cropland on which we grow corn strictly for ethanol production and put solar on it, something like 5% IIRC could power enough EVs to replace ALL vehicles in the US. Which means we could use a little more land for solar to power datacenters designed to be as environmentally friendly as possible. A government-run LLM run for the public.
It’s a pipe dream because in our current reality, it could never happen. But like universal health care and a living minimum wage, it should exist.
I know, I’m straying from the topic again. ADHD gonna ADHD. heh
I suppose as long as we were able to regulate AI companies to make sure they were forced to be upfront, honest, useful… it would be a sufficient compromise. But I’m sure we can’t even have that little.
That would truly be a public service.
Well, if we continue my analogy, government-run oil processing plants and gasoline subsidies have not historically worked well.
It’s a device of investing hard power into computing.
That cropland will repurpose itself by market laws if the change is so dramatic, I think it is. I don’t like the AI hype, but the major change of converting hard power into data and data into answers to questions is potent enough. It’s not just the difference in energy volumes between ethanol and solar power, it’s also that liquid fuel is easier to store. It’s not an equal comparison you’re making. But if the energy demand is skewed enough on the side of grid-connected datacenters, then economically solar power might become more attractive.
I think oligopoly on data is the main threat to this. Datacenters and hosters providing power to run whatever you want with whatever data you want are not the bottleneck for competition and good evolution.
Various data harvesting farms in which users roam are.
It’s funny, I’m optimistic lately and feel like this family of technologies is slowly killing the oligopolies of previous generation. Well, not themselves, but the mechanisms that brought them into existence. Of course they too have moved on past those, but it’s sort of an improvement.
It’s been darkly amusing watching the various social media hive-minds that used to be all for the concept of “information wanting to be free” suddenly discovering that they hate AI more than they love freedom of information.
I mean the end goal of AI is to monetize access to information while obsfucating the pre-existing free information so there’s no real conflict there?
AI is a technology being developed and deployed by millions of people and thousands of corporations, across a huge number of countries. Users can probably be counted in the hundreds of millions now. Which ones’ “end goal” is this?
There is no conflict here, the strain of “serving” clankers denies resources to real people that actually need to access that information.
Sometimes I wonder what the prevalent response to AI would be if we lived in a better world? There are environmental and resource concerns, but I think if they weren’t desperately trying to shove it everywhere to make a profit I’m not sure those would be unmanageable.
Information still wants to be free, but the way corporations are actually using AI right now in our economic hellscape punches people much lower in the (Maslow’s) hierarchy of needs.
The core product at the bottom of this is information. People feel information should be free. Corps wanna charge for it. Govs and influentials wanna use it to push an agenda.
I kinda see it like the library. There’s info and there’s the library. Traditionally these were made open to walk in and the library system allowed you access to all the info, peacefully, privately.
Now it’s like library has rooms. Some locked. They are owned by gob$bites some are owned by greedy. They aa regularly steal the open info, change it, hide it, copy it, they hijack the librarians and paint the walls. As they have access to the librarian they can see what I look it.
The library is no fun no more.
I know this seems like a simple childish comparison but it’s more that for a digital/virtual world we had the choice to build spaces and systems that afforded us real progression. We didn’t. The greedy got their hands on it.
Example - If I was building a church in the virtual world what key elements does it need? Who owns it? What can it give? How does it fund itself?
I know it’s more complex than this.



