Reddit’s API is effectively dead for archival. Third-party apps are gone. Reddit has threatened to cut off access to the Pushshift dataset multiple times. But 3.28TB of Reddit history exists as a torrent right now, and I built a tool to turn it into something you can browse on your own hardware.

The key point: This doesn’t touch Reddit’s servers. Ever. Download the Pushshift dataset, run my tool locally, get a fully browsable archive. Works on an air-gapped machine. Works on a Raspberry Pi serving your LAN. Works on a USB drive you hand to someone.

What it does: Takes compressed data dumps from Reddit (.zst), Voat (SQL), and Ruqqus (.7z) and generates static HTML. No JavaScript, no external requests, no tracking. Open index.html and browse. Want search? Run the optional Docker stack with PostgreSQL – still entirely on your machine.

API & AI Integration: Full REST API with 30+ endpoints – posts, comments, users, subreddits, full-text search, aggregations. Also ships with an MCP server (29 tools) so you can query your archive directly from AI tools.

Self-hosting options:

  • USB drive / local folder (just open the HTML files)
  • Home server on your LAN
  • Tor hidden service (2 commands, no port forwarding needed)
  • VPS with HTTPS
  • GitHub Pages for small archives

Why this matters: Once you have the data, you own it. No API keys, no rate limits, no ToS changes can take it away.

Scale: Tens of millions of posts per instance. PostgreSQL backend keeps memory constant regardless of dataset size. For the full 2.38B post dataset, run multiple instances by topic.

How I built it: Python, PostgreSQL, Jinja2 templates, Docker. Used Claude Code throughout as an experiment in AI-assisted development. Learned that the workflow is “trust but verify” – it accelerates the boring parts but you still own the architecture.

Live demo: https://online-archives.github.io/redd-archiver-example/ GitHub: https://github.com/19-84/redd-archiver (Public Domain)

Pushshift torrent: https://academictorrents.com/details/1614740ac8c94505e4ecb9d88be8bed7b6afddd4

  • 1984@lemmy.today
    link
    fedilink
    English
    arrow-up
    1
    ·
    2 hours ago

    I dont know if historic data is very interesting. Its the new content we are interested in…

  • a1studmuffin@aussie.zone
    link
    fedilink
    English
    arrow-up
    30
    ·
    12 hours ago

    This seems especially handy for anyone who wants a snapshot of Reddit from pre-enshittification and AI era, where content was more authentic and less driven by bots and commercial manipulation of opinion. Just choose the cutoff date you want and stick with that dataset.

    • 19-84@lemmy.dbzer0.comOP
      link
      fedilink
      English
      arrow-up
      27
      ·
      11 hours ago

      2005-06 to 2024-12

      however the data from 2025-12 has been released already, it just needs to be split and reprocessed for 2025 by watchful1. once that happens then you can host archive up till end of 2025. i will probably add support for importing data from the arctic shift dumps instead so that archives can be updated monthly.

  • breakingcups@lemmy.world
    link
    fedilink
    English
    arrow-up
    43
    arrow-down
    2
    ·
    18 hours ago

    Just so you’re aware, it is very noticeable that you also used AI to help write this post and its use of language can throw a lot of people off.

    Not to detract from your project, which looks cool!

    • muusemuuse@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      4
      ·
      8 hours ago

      You know what would be a good way to do t? Take all that content and throw it on a federated service like ours. Publicly visible. No bullshit. And no reason to visit Reddit to get that content. Take their traffic away.

  • 19-84@lemmy.dbzer0.comOP
    link
    fedilink
    English
    arrow-up
    13
    arrow-down
    1
    ·
    16 hours ago

    PLEASE SHARE ON REDDIT!!! I have never had a reddit account and they will NOT let me post about this!!

    • Bazell@lemmy.zip
      link
      fedilink
      English
      arrow-up
      2
      ·
      edit-2
      3 hours ago

      We can’t share this on Reddit, but we can share this on other platforms. Basically, what you have done is you scraped tons of data for AI learning. Something like “create your own AI Redditor” . And greedy Reddit management will dislike it very much even if you will tell them that this is for the cultural inheritance. Your work is great anyway. Sadly, that I do not have enough free space to load and store all this data.

    • 19-84@lemmy.dbzer0.comOP
      link
      fedilink
      English
      arrow-up
      5
      ·
      15 hours ago

      redarc uses reactjs to serve the web app, redd-archiver uses a hybrid architecture that combines static page generation with postgres search via flask. is more like a hybrid static site generator with web app capabilities through docker and flask. the static pages with sorted indexes can be viewed offline and served on hosts like github and codeberg pages.

  • frongt@lemmy.zip
    link
    fedilink
    English
    arrow-up
    13
    ·
    18 hours ago

    And only a 3.28 TB database? Oh, because it’s compressed. Includes comments too, though.

      • muusemuuse@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        5
        ·
        8 hours ago

        If only I had the space and bandwidth. I would host a mirror via Lemmy and drag the traffic away.

        Actually, isn’t the a way to decentralize this that can be accessed from regular browsers on the internet? Live content here, archive everywhere.

        • psycotica0@lemmy.ca
          link
          fedilink
          English
          arrow-up
          2
          ·
          5 hours ago

          Someone could format it into essentially static pages and publish it on IPFS. That would probably be the easiest “decentralized hosting” method that remains browsable

  • SteveCC@lemmy.world
    link
    fedilink
    English
    arrow-up
    10
    ·
    17 hours ago

    Wow, great idea. So much useful information and discussion that users have contributed. Looking forward to checking this out.

  • Tanis Nikana@lemmy.world
    link
    fedilink
    English
    arrow-up
    8
    arrow-down
    1
    ·
    17 hours ago

    Reddit is hot stinky garbage but can be useful for stuff like technical support and home maintenance.

    Voat and Ruqqus are straight-up misinformation and fascist propaganda, and if you excise them from your data set, your data will dramatically improve.

    • irmadlad@lemmy.world
      link
      fedilink
      English
      arrow-up
      10
      ·
      edit-2
      11 hours ago
      spoiler

      Maybe read where OP says ‘Yes I used AI, English is not my first language.’ Furthermore, are ethnic slurs really necessary here?

        • El Barto@lemmy.world
          link
          fedilink
          English
          arrow-up
          3
          ·
          7 hours ago

          I disagree. I don’t like AI slop. But he’s using AI here in a way that is very much intended. I want to share something in Mandarin, I don’t know Mandarin. If only there was a way to transform my thoughts into Mandarin…

        • irmadlad@lemmy.world
          link
          fedilink
          English
          arrow-up
          5
          ·
          8 hours ago

          How many languages do you know fluently? I get that people have a definite opinion about AI. Like I told another Lemmy user, I have a definite opinion about the ‘arr’ stack which conservatively, 75% of selfhosters run. However, you don’t hear me out here beating my tin pan at the very mention of the ‘arr’ stack. Why? Because I assume you all are autonomous adults, capable of making your own decisions. Secondly, wouldn’t that get a bit tedious and annoying over time? If you don’t like AI, don’t use it ffs. Why castigate individuals who use AI? What does that do? I would really like to know what denigrating and browbeating users who use AI accomplishes.