Self-host Reddit – 2.38B posts, works offline, yours forever

19-84@lemmy.dbzer0.com · 18 hours ago

Self-host Reddit – 2.38B posts, works offline, yours forever

offspec@lemmy.world · 4 hours ago

It would be neat for someone to migrate this data set to a Lemmy instance

yeehaw@lemmy.ca · 3 hours ago

Now this is a good idea.

1984@lemmy.today · 2 hours ago

I dont know if historic data is very interesting. Its the new content we are interested in…

usernameusername@sh.itjust.works · 6 hours ago

so kinda like kiwix but for reddit. That is so cool

BigDiction@lemmy.world · 6 hours ago

You should be very proud of this project!! Thank you for sharing.

a1studmuffin@aussie.zone · 12 hours ago

This seems especially handy for anyone who wants a snapshot of Reddit from pre-enshittification and AI era, where content was more authentic and less driven by bots and commercial manipulation of opinion. Just choose the cutoff date you want and stick with that dataset.

Tiger@sh.itjust.works · 12 hours ago

What is the timing of the dataset, up through which date in time?

19-84@lemmy.dbzer0.com · 11 hours ago

2005-06 to 2024-12

however the data from 2025-12 has been released already, it just needs to be split and reprocessed for 2025 by watchful1. once that happens then you can host archive up till end of 2025. i will probably add support for importing data from the arctic shift dumps instead so that archives can be updated monthly.

douglasg14b@lemmy.world · edit-2 11 hours ago

It’s literally says in the link. Go to the link and it’s the title.

breakingcups@lemmy.world · 18 hours ago

Just so you’re aware, it is very noticeable that you also used AI to help write this post and its use of language can throw a lot of people off.

Not to detract from your project, which looks cool!

19-84@lemmy.dbzer0.com · 18 hours ago

Yes I used AI, English is not my first language. Thank you for the kind words!

Melvin_Ferd@lemmy.world · 10 hours ago

You’re awesome. AI is fun and there’s nothing wrong with using it especially how you did. Lemmy was hit hard with AI hate propaganda. China probably trying to stop it’s growth and development in other countries or some stupid shit like that. But you’re good. Fuck them

mustlane@lemmy.zip · edit-2 11 hours ago

Removed by mod

idealism_nearby@lemmy.world · 11 hours ago

Would love to see you learn an entire foreign language just so you are able to communicate with the world without being laughed at by people as hostile as yourself.

fartographer@lemmy.world · 3 hours ago

I can’t even learn my own language!

MadMonkey@lemmy.world · 11 hours ago

Brush, you do not seem like a nice person to be around.

Spread love and kindness, not hate.

I hope you have a better rest of your day.

Leah@piefed.blahaj.zone · 11 hours ago

Shut the fuck up loser.

irmadlad@lemmy.world · 11 hours ago

Yu mussi bawn backacow

Melvin_Ferd@lemmy.world · 10 hours ago

I fucking hate lemmy sometimes.

😈MedicPig🐷BabySaver😈@lemmy.world · 14 hours ago

Fuck Reddit and Fuck Spez.

muusemuuse@sh.itjust.works · 8 hours ago

You know what would be a good way to do t? Take all that content and throw it on a federated service like ours. Publicly visible. No bullshit. And no reason to visit Reddit to get that content. Take their traffic away.

El Barto@lemmy.world · 7 hours ago

Where would it be hosted so that Conde Nast lawyers can’t touch it?

19-84@lemmy.dbzer0.com · 16 hours ago

PLEASE SHARE ON REDDIT!!! I have never had a reddit account and they will NOT let me post about this!!

Bazell@lemmy.zip · edit-2 3 hours ago

We can’t share this on Reddit, but we can share this on other platforms. Basically, what you have done is you scraped tons of data for AI learning. Something like “create your own AI Redditor” . And greedy Reddit management will dislike it very much even if you will tell them that this is for the cultural inheritance. Your work is great anyway. Sadly, that I do not have enough free space to load and store all this data.

El Barto@lemmy.world · edit-2 7 hours ago

Anyone doing this will be banned in that platform.

Avid Amoeba@lemmy.ca · 15 hours ago

How does this compare to redarc? It seems to be similar.

19-84@lemmy.dbzer0.com · 15 hours ago

redarc uses reactjs to serve the web app, redd-archiver uses a hybrid architecture that combines static page generation with postgres search via flask. is more like a hybrid static site generator with web app capabilities through docker and flask. the static pages with sorted indexes can be viewed offline and served on hosts like github and codeberg pages.

frongt@lemmy.zip · 18 hours ago

And only a 3.28 TB database? Oh, because it’s compressed. Includes comments too, though.

douglasg14b@lemmy.world · edit-2 11 hours ago

Yeah, it should inflate to 15TB or more I think

muusemuuse@sh.itjust.works · 8 hours ago

If only I had the space and bandwidth. I would host a mirror via Lemmy and drag the traffic away.

Actually, isn’t the a way to decentralize this that can be accessed from regular browsers on the internet? Live content here, archive everywhere.

psycotica0@lemmy.ca · 5 hours ago

Someone could format it into essentially static pages and publish it on IPFS. That would probably be the easiest “decentralized hosting” method that remains browsable

19-84@lemmy.dbzer0.com · 18 hours ago

Yes! Too many comments to count in a reasonable amount of time!

SteveCC@lemmy.world · 17 hours ago

Wow, great idea. So much useful information and discussion that users have contributed. Looking forward to checking this out.

19-84@lemmy.dbzer0.com · 17 hours ago

thank you!!! i built on great ideas from others! i cant take all the credit 😋

Tanis Nikana@lemmy.world · 17 hours ago

Reddit is hot stinky garbage but can be useful for stuff like technical support and home maintenance.

Voat and Ruqqus are straight-up misinformation and fascist propaganda, and if you excise them from your data set, your data will dramatically improve.

19-84@lemmy.dbzer0.com · 17 hours ago

the great part is that since everything is built it is easy to support any additional data! there is even an issue template to submit new data source! https://github.com/19-84/redd-archiver/blob/main/.github/ISSUE_TEMPLATE/submit-data-source.yml

Howlinghowler110th@kbin.earth · 17 hours ago

I think this is a good use case for AI and Impressed with it. wish the instructions were more clear how to set up though.

19-84@lemmy.dbzer0.com · 17 hours ago

thank you! the instruction are little overwhelming, check out the quickstart if you haven’t yet! https://github.com/19-84/redd-archiver/blob/main/QUICKSTART.md

mustlane@lemmy.zip · edit-2 11 hours ago

Removed by mod

irmadlad@lemmy.world · edit-2 11 hours ago

spoiler

Maybe read where OP says ‘Yes I used AI, English is not my first language.’ Furthermore, are ethnic slurs really necessary here?

Cybersteel@lemmy.world · 9 hours ago

Then he’s no better than Reddit who also uses AI no?

El Barto@lemmy.world · 7 hours ago

I disagree. I don’t like AI slop. But he’s using AI here in a way that is very much intended. I want to share something in Mandarin, I don’t know Mandarin. If only there was a way to transform my thoughts into Mandarin…

irmadlad@lemmy.world · 8 hours ago

How many languages do you know fluently? I get that people have a definite opinion about AI. Like I told another Lemmy user, I have a definite opinion about the ‘arr’ stack which conservatively, 75% of selfhosters run. However, you don’t hear me out here beating my tin pan at the very mention of the ‘arr’ stack. Why? Because I assume you all are autonomous adults, capable of making your own decisions. Secondly, wouldn’t that get a bit tedious and annoying over time? If you don’t like AI, don’t use it ffs. Why castigate individuals who use AI? What does that do? I would really like to know what denigrating and browbeating users who use AI accomplishes.

Self-host Reddit – 2.38B posts, works offline, yours forever

Self-host Reddit – 2.38B posts, works offline, yours forever

GitHub - 19-84/redd-archiver: A PostgreSQL-backed archive generator that creates browsable HTML archives from link aggregator platforms including Reddit, Voat, and Ruqqus.

Fuck Reddit and Fuck Spez.