garote | Updated Dreamwidth backup script

Current Mood: accomplished

Entry tags:

computing

Updated Dreamwidth backup script

For quite a while I've been looking for some nice way to get a complete backup of my Dreamwidth content onto my local machine. And I gotta wonder... Is this not a very popular thing? There are a lot of users on here, posting a lot of cool and unique content. Wouldn't they want to have a copy, just in case something goes terribly wrong?

I found a Python script that does a backup, and was patched to work with Dreamwidth, but the backup took the form of a huge pile of XML files. Thousands of them. I wanted something more flexible, so I forked the script and added an optional flag that writes everything (entries, comments, userpic info) to a single SQLite database.

https://github.com/GBirkel/ljdump

Folks on MacOS can just grab the contents of the repo and run the script. All the supporting modules should already be present in the OS. Windows people will need to install some version of Python.

For what it's worth, here's the old discussion forum for the first version of the script, released way back around 2009.

Update, 2024-03-25:

The script now also downloads and stores tag and mood information.

Update, 2024-03-26:

After synchronizing, the script now generates browseable HTML files of the journal, including entries for individual pages with comment threads, and linked history pages showing 20 entries at a time.

Moods, music, tags, and custom icons are shown for the entries where applicable.

Currently the script uses the stylesheet for my personal journal (this one), but you can drop in the styles for yours and it should accept them. The structure of the HTML is rendered as close as possible to what Dreamwidth makes.

Update, 2024-03-28:

The script can also attempt to store local copies of the images embedded in journal entries. It organizes them by month in an images folder next to all the HTML. This feature is enabled with a "--cache_images" argument.

Every time you run it, it will attempt to cache 200 more images, going from oldest to newest. It will skip over images it's already tried and failed to fetch, until 24 hours have gone by, then it will try those images once again.

The image links in your entries are left unchanged in the database. They're swapped for local links only in the generated HTML pages.

Update, 2024-04-02:

The script is now ported to Python 3, and tested on both Windows and MacOS. I've added new setup instructions for both that are a little easier to follow.

Update, 2024-04-30:

Added an option to stop the script from trying to cache images that failed to cache once already.

2024-06-26: Version 1.7.6

Attempt to fix music field parsing for some entries.
Fix for crash on missing security properties for some entries.
Image fetch timeout reduced from 5 seconds to 4 seconds.

2024-08-14: Version 1.7.7

Slightly improves unicode handling in tags and the music field.

2024-09-07: Version 1.7.8

Changes "stop at fifty" command line flag to a "max n" argument, with a default of 400, and applies it to comments as well as entries. This may help people who have thousands of comments complete their initial download. I recommend using the default at least once, then using a value of 1500 afterward until you're caught up.

2024-09-18: Version 1.7.9

Table of contents for the table of contents!
First version of an "uncached images" report to help people find broken image links in their journal.

Threaded | Top-Level Comments Only

It's not posting anything. It's using the post management interface to read posts. Apparently LJ has some kind of strict throttling that Dreamwidth doesn't have. Or perhaps it's the difference between paid and non-paid accounts??

The script output mentions items, like "L-1018", "L-1019", etc. Each of those is either an entry or a change to an entry. When you run it again, do those L numbers change? Or does it always mention the same set?

It seems my computer had restarted in the time I last ran the script, so I can't scrollback through the output.

I did check my 'entries' folder and there seem to be 2249 html files in there but the last post is 'entry-2418.html' so I'm not sure if either numbers were skipped or entries were skipped?

Is there a way to create detailed logs? I'll try to run this script from scratch again.

There are often gaps in entry ID numbers in people's journals. That's normal.

Have encoding problems with my journal :-/

I'm trying to fetch it from the livejournal.com api directly (didn't try with DW yet cause I want to try to fetch from source first)

And I get weird characters like this:
Adding new event 2 at 2004-02-24T00:44:00+00:00: ÑÐ°ÑÑÑÐºÐ¸...

Which should be this: https://vicnaum.livejournal.com/623.html

I've read that you should go to settings/OldEncoding or smth and change it to cp1251 for Cyrilic (Windows), but this page doesn't exist anymore...

Interesting, is API returning it already broken, or it can be fixed within Python still?

Hi! Thank you for your work on this script. I've run into an error while trying to back-up my journal:

Traceback (most recent call last):
  File "-\ljdump.py", line 501, in 
    ljdump(
  File "-\ljdump.py", line 177, in ljdump
    insert_or_update_event(cur, verbose, ev)
  File "-\ljdumpsqlite.py", line 417, in insert_or_update_event
    cur.execute("""
sqlite3.InterfaceError: Error binding parameter :props_taglist - probably unsupported type.

I have tried re-running the script multiple times but it always errors out and stops at the same entry with this error.

Is this dreamwidth or livejournal?
How old is the entry?
Is there more than one tag assigned to that entry?
Anything unique about those tags?

DW entry from 2009, it was imported from LJ where it had tags but I have removed all tags from it at some point (one of the original tags has a "♥" symbol in it).

Hmm... I have similar entries myself. Although perhaps my entries from LJ didn't have more than one tag, since I only started getting into tags in a big way after the DW migration. So, I wonder if perhaps this particular tags data was brought over as an array...

The other possibility is that somehow the tag is being treated as binary data because of some unicode snafu, and the field wants either null or a string...
I'm going to run a brief experiment and get back to you.

Aha. I put the tag "ճնքաքճ🥸" on my latest entry, and go the same error. :D

Okay, here's a new release with a minor change:

https://github.com/GBirkel/ljdump/releases/tag/v1.7.7

I just re-downloaded my whole journal with it, and it appears to be treating unicode in tags more correctly. :D

Thank you so much for taking the time to work on it! I'm running it now and while it's still going (got a lot of entries...) it successfully downloaded past the entry that kept stopping it.

w00t!

Apologies for coming back with another report... It seems that the script is timing out while trying to fetch comments? (I didn't even realize it would back-up those.)

Fetching journal comments for: falkner
*** Error fetching comment meta, possibly not community maintainer?
*** HTTP Error 504: Gateway Time-out
Fetching current users map from database
Traceback (most recent call last):
  File "-\ljdump.py", line 501, in 
    ljdump(
  File "-\ljdump.py", line 245, in ljdump
    usermap = get_users_map(cur, verbose)
  File "-\ljdumpsqlite.py", line 780, in get_users_map
    cur.execute("SELECT id, name FROM users_map")
sqlite3.ProgrammingError: Cannot operate on a closed cursor.

Hmm... "Gateway timeout" makes me think it's some kind of throttling happening server-side. Did this happen during a particularly long run of the script?

I run the script with -f and after downloading some entries (quickly, no problem there) it seemingly got stuck on the comments. Could it be because there's 10k+ comments on my journal? Is there a way I can finish downloading entries and exclude comment back-up?

I sat down and had a look at this... The comment fetching code is based heavily on the old ljdump code, and it appears to "catch up" with un-fetched comments all at once, no matter how many comments that could potentially be.

So, on the initial run, it will attempt to fetch the ID of every comment in the journal, and then attempt to fetch the body of every comment it gets an ID for, which could potentially be many thousands.

To allow it to fetch comments in intervals I'll need to do some re-wiring. Bear with me...

No hurry, of course. Thank you for taking the time to look into this at all!

Okay, there's a version 1.7.8 up now, that fetches comments using the same strategy as it fetches entries. It will grab 400 of each at a time by default, but you can set that to some other number, e.g. "--max 1500". Hopefully this will let you grab comments in batches small enough to appease the mighty server gods.

Unfortunately the database schema has changed very slightly from the earlier one, so you'll need to start the journal download from scratch again, instead of using whatever you've currently got.

Unfortunately I'm not sure how to fix this issue. It's possible that LJ sends the encoding information as part of the XML when one fetches an entry, e.g.:

<?xml version="1.0" encoding="WINDOWS-1251"?> ..... </xml>

and if so, that can be used to decide what encoding to use when converting it to Unicode. But right now, unless there's some magic happening in the Python XML parser I don't know about, it always assumes UTF-8 so stuff in e.g. WINDOWS-1251 will get mangled.

LJ renders it just fine when presenting its own web interface, so either LJ preserves the encoding information internally, or it follows some kind of guessing procedure to convert it to UTF-8. One could theoretically answer that question by crawling through the LJ source code.

First thanks; pulled old LJ on Mac OS and am so thankful to have it.
Second: the script skipped about ~7% of entries. Is there a way to re-run that would get it to re-try these non-pulled entries, rather than start from last pull/looking for new entries?

Unfortunately, it's out of my (the script's) control which entries LJ decides are available in the history API, so if any are missing there isn't much I can do. (Note that it's normal for there to be gaps in the ID numbers that the script appears to be fetching and that's not an indication of a skipped entry.) Do you have an error log or an example entry you can point me at?

Thank you for doing this and keeping it up!

You're quite welcome! Let me know if you hit any snags.

Also: Intrigued: What is "cat tipping"?

Oh god, I haven't updated my journal profiles in decades. I once had a cat that I liked to gently push with my foot and he'd just knock right over and then I'd pet his belly. Most cats grip the floor and refuse to tip.

Hah! Yeah, my profile is frozen in time as of about 20 years ago.
Sounds like an adorable cat. Of the three currently in my life, one is too old and distinguished to push over, one would wander sullenly away, and the other would gently begin murdering my foot. Viva variety!

Threaded | Top-Level Comments Only

Page 2 of 3

Updated Dreamwidth backup script

Re: KeyError: 'protected'

Re: KeyError: 'protected'

Re: KeyError: 'protected'

Re: <3

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

Re: <3

Thank you + Q

Re: Thank you + Q

no subject

no subject

no subject

no subject