garote | Updated Dreamwidth backup script

Current Mood: accomplished

Entry tags:

computing

Updated Dreamwidth backup script

For quite a while I've been looking for some nice way to get a complete backup of my Dreamwidth content onto my local machine. And I gotta wonder... Is this not a very popular thing? There are a lot of users on here, posting a lot of cool and unique content. Wouldn't they want to have a copy, just in case something goes terribly wrong?

I found a Python script that does a backup, and was patched to work with Dreamwidth, but the backup took the form of a huge pile of XML files. Thousands of them. I wanted something more flexible, so I forked the script and added an optional flag that writes everything (entries, comments, userpic info) to a single SQLite database.

https://github.com/GBirkel/ljdump

Folks on MacOS can just grab the contents of the repo and run the script. All the supporting modules should already be present in the OS. Windows people will need to install some version of Python.

For what it's worth, here's the old discussion forum for the first version of the script, released way back around 2009.

Update, 2024-03-25:

The script now also downloads and stores tag and mood information.

Update, 2024-03-26:

After synchronizing, the script now generates browseable HTML files of the journal, including entries for individual pages with comment threads, and linked history pages showing 20 entries at a time.

Moods, music, tags, and custom icons are shown for the entries where applicable.

Currently the script uses the stylesheet for my personal journal (this one), but you can drop in the styles for yours and it should accept them. The structure of the HTML is rendered as close as possible to what Dreamwidth makes.

Update, 2024-03-28:

The script can also attempt to store local copies of the images embedded in journal entries. It organizes them by month in an images folder next to all the HTML. This feature is enabled with a "--cache_images" argument.

Every time you run it, it will attempt to cache 200 more images, going from oldest to newest. It will skip over images it's already tried and failed to fetch, until 24 hours have gone by, then it will try those images once again.

The image links in your entries are left unchanged in the database. They're swapped for local links only in the generated HTML pages.

Update, 2024-04-02:

The script is now ported to Python 3, and tested on both Windows and MacOS. I've added new setup instructions for both that are a little easier to follow.

Update, 2024-04-30:

Added an option to stop the script from trying to cache images that failed to cache once already.

2024-06-26: Version 1.7.6

Attempt to fix music field parsing for some entries.
Fix for crash on missing security properties for some entries.
Image fetch timeout reduced from 5 seconds to 4 seconds.

2024-08-14: Version 1.7.7

Slightly improves unicode handling in tags and the music field.

2024-09-07: Version 1.7.8

Changes "stop at fifty" command line flag to a "max n" argument, with a default of 400, and applies it to comments as well as entries. This may help people who have thousands of comments complete their initial download. I recommend using the default at least once, then using a value of 1500 afterward until you're caught up.

2024-09-18: Version 1.7.9

Table of contents for the table of contents!
First version of an "uncached images" report to help people find broken image links in their journal.

Flat | Top-Level Comments Only

Thank you!

Would you be willing to test the latest version?

I'll have to, yes.

thanks!

Seem to be running into an error where certain entries end up getting an SQLite Error, where "Error binding paramter 12: type 'Binary' is not supported." I have no idea what the issue is, but it seems to happen most commonly in entries where there are characters with diacritical marks on them in the music field. It's not all of them, specifically, but it does track pretty well. Maybe it's having Unicode parsing issues?

Dangit, I thought I ironed all those out. The music field you say! I'll poke at it and see what I find.

Thank you. It's all the fun when dealing with as much age and material has gone into these venerable sites, and their quirks.

Okay, I think I might have fixed this issue in v.1.7.6 ... Give it a shot?
https://github.com/GBirkel/ljdump/releases

1.7.6 ran without errors, other than the ones I'm used to, so it pulled in the entries. That's good.

What it seems to have exchanged in return is to render those glyphs with diacritical marks in mojibake instead of their correct glyphs, and it may or may not be rendering HTML escape codes correctly or well, but it is definitely not erroring out on those posts that it was erroring before. The solution you implemented is at least functional.

Thank you for the script! FYI I ran into a little error on windows

line 823, in get_or_create_cached_image_record
    date_or_none = date_first_seen.strftime('%s')
ValueError: Invalid format string

Got around it by adding import time with the other imports and changing line 823 (now 824) to date_or_none = time.mktime(date_first_seen.timetuple()) (fix stolen from here, dunno if it's a good fix tho)

EDIT: I also ended up making some more changes to download images hosted on Dreamwidth, also in their original resolution - patch file below in case its handy.
Edit again: fix running ljdumptohtml.py alone, allow images to have attributes between <img and src="

Patch file

--- C:/dw/ljdump-1.7.4/ChangeLog	Sat May 11 00:00:00 2024
+++ C:/dw/ljdump-1.7.4-patched/ChangeLog	Sat May 11 00:00:00 2024
--- C:/dw/ljdump-1.7.4/ljdump.config.sample	Sat May 11 00:00:00 2024
+++ C:/dw/ljdump-1.7.4-patched/ljdump.config.sample	Sat May 11 00:00:00 2024
@@ -3,4 +3,5 @@
     <server>https://livejournal.com</server>
     <username>myaccount</username>
     <password>mypassword</password>
+    <ljuniq>ljuniq cookie if you want to download Dreamwidth hosted images</ljuniq>
 </ljdump>
--- C:/dw/ljdump-1.7.4/ljdump.py	Sat May 11 00:00:00 2024
+++ C:/dw/ljdump-1.7.4-patched/ljdump.py	Sat May 11 00:00:00 2024
@@ -70,7 +70,7 @@
     return e[0].firstChild.nodeValue
 
 
-def ljdump(journal_server, username, password, journal_short_name, verbose=True, stop_at_fifty=False, make_pages=False, cache_images=False, retry_images=True):
+def ljdump(journal_server, username, password, ljuniq, journal_short_name, verbose=True, stop_at_fifty=False, make_pages=False, cache_images=False, retry_images=True):
 
     m = re.search("(.*)/interface/xmlrpc", journal_server)
     if m:
@@ -417,6 +417,7 @@
         ljdumptohtml(
             username=username,
             journal_short_name=journal_short_name,
+            ljuniq=ljuniq,
             verbose=verbose,
             cache_images=cache_images,
             retry_images=retry_images
@@ -444,6 +445,11 @@
             password = password_els[0].childNodes[0].data
         else:
             password = getpass("Password: ")
+        ljuniq_els = config.documentElement.getElementsByTagName("ljuniq")
+        if len(ljuniq_els) > 0:
+            ljuniq = ljuniq_els[0].childNodes[0].data
+        else:
+            ljuniq = getpass("ljuniq cookie (for Dreamwidth hosted image downloads, leave blank otherwise): ")
         journals = [e.childNodes[0].data for e in config.documentElement.getElementsByTagName("journal")]
         if not journals:
             journals = [username]
@@ -457,6 +463,7 @@
         print
         username = input("Username: ")
         password = getpass("Password: ")
+        ljuniq = getpass("ljuniq cookie (for Dreamwidth hosted image downloads, leave blank otherwise): ")
         print
         print("You may back up either your own journal, or a community.")
         print("If you are a community maintainer, you can back up both entries and comments.")
@@ -474,6 +481,7 @@
             journal_server=journal_server,
             username=username,
             password=password,
+            ljuniq=ljuniq,
             journal_short_name=journal,
             verbose=args.verbose,
             stop_at_fifty=args.fifty,
--- C:/dw/ljdump-1.7.4/ljdumpsqlite.py	Sat May 11 00:00:00 2024
+++ C:/dw/ljdump-1.7.4-patched/ljdumpsqlite.py	Sat May 11 00:00:00 2024
@@ -30,6 +30,8 @@
 from sqlite3 import Error
 from xml.sax import saxutils
 from builtins import str
+import time
+import re
 
 
 # Subclass of tzinfo swiped mostly from dateutil
@@ -803,6 +805,10 @@
         SELECT id, url, filename, date_first_seen, date_last_attempted, cached FROM cached_images
         WHERE url = :url""", {'url': image_url})
     row = cur.fetchone()
+    pattern = re.compile('^https://(\w+).dreamwidth.org/file/\d+x\d+/(.+)')
+    if pattern.match(image_url):
+        result = pattern.search(image_url)
+        get_or_create_cached_image_record(cur, verbose, 'https://' + result.group(1) + '.dreamwidth.org/file/' + result.group(2), date_first_seen)
     if row:
         if verbose:
             print('Found image cache record for: %s' % (image_url))
@@ -820,7 +826,7 @@
             print('Creating image cache record for: %s' % (image_url))
         date_or_none = None
         if date_first_seen:
-            date_or_none = date_first_seen.strftime('%s')
+            date_or_none = time.mktime(date_first_seen.timetuple())
         data = {
             "id": None,
             "url": image_url,
--- C:/dw/ljdump-1.7.4/ljdumptohtml.py	Sat May 11 00:00:00 2024
+++ C:/dw/ljdump-1.7.4-patched/ljdumptohtml.py	Mon May 13 23:37:54 2024
@@ -385,12 +385,18 @@
 
 def resolve_cached_image_references(content, image_urls_to_filenames):
     # Find any image URLs
-    urls_found = re.findall(r'img[^\"\'()<>]*\ssrc\s?=\s?[\'\"](https?:/+[^\s\"\'()<>]+)[\'\"]', content, flags=re.IGNORECASE)
+    urls_found = re.findall(r'<img[^<>]*\ssrc\s?=\s?[\'\"](https?:/+[^\s\"\'()<>]+)[\'\"]', content, flags=re.IGNORECASE)
     # Find the set of URLs that have been resolved to local files
     resolved_urls = []
     for image_url in urls_found:
         if image_url in image_urls_to_filenames:
             resolved_urls.append(image_url)
+        pattern = re.compile('^https://(\w+).dreamwidth.org/file/\d+x\d+/(.+)')
+        if pattern.match(image_url):
+            result = pattern.search(image_url)
+            full_url = 'https://' + result.group(1) + '.dreamwidth.org/file/' + result.group(2)
+            if full_url in image_urls_to_filenames:
+                resolved_urls.append(full_url)
     # Swap them in
     for image_url in resolved_urls:
         filename = image_urls_to_filenames[image_url]
@@ -630,10 +636,15 @@
     return html_as_string
 
 
-def download_entry_image(img_url, journal_short_name, subfolder, url_id):
+def download_entry_image(img_url, journal_short_name, subfolder, url_id, entry_url, ljuniq):
     try:
-        image_req = urllib.request.urlopen(img_url, timeout = 5)
+        pattern = re.compile('^https://(\w+).dreamwidth.org/file/')
+        headers = {}
+        if pattern.match(img_url):
+            headers = {'Referer': entry_url, 'Cookie': "ljuniq="+ljuniq}
+        image_req = urllib.request.urlopen(urllib.request.Request(img_url, headers = headers), timeout = 5)
         if image_req.headers.get_content_maintype() != 'image':
+            print('Content type was not expected, image skipped: ', img_url, image_req.headers.get_content_maintype())
             return (1, None)
         extension = MimeExtensions.get(image_req.info()["Content-Type"], "")
 
@@ -681,7 +692,7 @@
         return (1, None)
 
 
-def ljdumptohtml(username, journal_short_name, verbose=True, cache_images=True, retry_images=True):
+def ljdumptohtml(username, journal_short_name, ljuniq, verbose=True, cache_images=True, retry_images=True):
     if verbose:
         print("Starting conversion for: %s" % journal_short_name)
 
@@ -741,8 +752,17 @@
                 e_id = entry['itemid']
                 entry_date = datetime.utcfromtimestamp(entry['eventtime_unix'])
                 entry_body = entry['event']
-                urls_found = re.findall(r'img[^\"\'()<>]*\ssrc\s?=\s?[\'\"](https?:/+[^\s\"\'()<>]+)[\'\"]', entry_body, flags=re.IGNORECASE)
+                initial_urls_found = re.findall(r'<img[^<>]*\ssrc\s?=\s?[\'\"](https?:/+[^\s\"\'()<>]+)[\'\"]', entry_body, flags=re.IGNORECASE)
                 subfolder = entry_date.strftime("%Y-%m")
+                urls_found = []
+                for image_url in initial_urls_found:
+                    urls_found.append(image_url)
+                    pattern = re.compile('^https://(\w+).dreamwidth.org/file/\d+x\d+/(.+)')
+                    if pattern.match(image_url):
+                        result = pattern.search(image_url)
+                        full_url = 'https://' + result.group(1) + '.dreamwidth.org/file/' + result.group(2)
+                        urls_found.append(full_url)
+
                 for image_url in urls_found:
                     cached_image = get_or_create_cached_image_record(cur, verbose, image_url, entry_date)
                     try_cache = True
@@ -758,7 +778,7 @@
                         image_id = cached_image['id']
                         cache_result = 0
                         img_filename = None
-                        (cache_result, img_filename) = download_entry_image(image_url, journal_short_name, subfolder, image_id)
+                        (cache_result, img_filename) = download_entry_image(image_url, journal_short_name, subfolder, image_id, entry['url'], ljuniq)
                         if (cache_result == 0) and (img_filename is not None):
                             report_image_as_cached(cur, verbose, image_id, img_filename, entry_date)
                             image_resolve_max -= 1
@@ -955,6 +975,11 @@
         journals = [e.childNodes[0].data for e in config.documentElement.getElementsByTagName("journal")]
         if not journals:
             journals = [username]
+        ljuniq_els = config.documentElement.getElementsByTagName("ljuniq")
+        if len(ljuniq_els) > 0:
+            ljuniq = ljuniq_els[0].childNodes[0].data
+        else:
+            ljuniq = getpass("ljuniq cookie (for Dreamwidth hosted image downloads, leave blank otherwise): ")
     else:
         print("ljdumptohtml - livejournal (or Dreamwidth, etc) archive to html utility")
         print
@@ -976,6 +1001,7 @@
         ljdumptohtml(
             username=username,
             journal_short_name=journal,
+            ljuniq=ljuniq,
             verbose=args.verbose,
             cache_images=args.cache_images,
             retry_images=args.retry_images
--- C:/dw/ljdump-1.7.4/README.md	Sat May 11 00:00:00 2024
+++ C:/dw/ljdump-1.7.4-patched/README.md	Sat May 11 00:00:00 2024
--- C:/dw/ljdump-1.7.4/stylesheet.css	Sat May 11 00:00:00 2024
+++ C:/dw/ljdump-1.7.4-patched/stylesheet.css	Sat May 11 00:00:00 2024

Edited (Added patch for Dreamwidth hosted images) 2024-05-18 04:21 (UTC)

I've made a new release incorporating a slightly rewritten version of your changes... But I'm not sure how the ljuniq thing works. Where do we get that value?

https://github.com/GBirkel/ljdump/releases/tag/v1.7.5

Thank you!

I get the ljuniq cookie by opening a new private browsing window, opening the network request section of the browser window, going to https://www.dreamwidth.org/ then solving the CAPTCHA. It redirects back to the homepage which returns a set-cookie HTTP header shown in the developer tools, the line looks something like

Long comment, click to open...

set-cookie
	ljuniq=ewp0jLxxxx97IQp%3A171xxxx687; domain=.dreamwidth.org; path=/; expires=Sat, 20 Jul 2024 22:31:27 GMT; secure

(I replaced parts of it with "x" in that since it's just an example.)

The part that looks like ewp0jLxxxx97IQp%3A171xxxx687 is the important part, I copy that and paste it as the <ljuniq> value in the configuration file. Note the %3A in the cookie I think needs to be changed to a : (since it's "URL encoded")

So the end configuration option looks like

<ljuniq>ewp0jLxxxx97IQp:171xxxx687</ljuniq>

That cookie tends to expire really easily in my experience (or maybe it's my user error, I'm bad at doing things right). I'm not sure what the conditions are, but it sometimes takes me a few tries to get it to let me download the images. If it fails, it'll say "Content type text not expected, image skipped" for the image that didn't work. I've used commenting out the logic at the end of the "Respect the global image cache setting" bit in ljdumptohtml.py and then running only ljdumptohtml.py retry the images that failed without waiting 24h or deleting the cache rows from the database manually.

Unrelated if it helps, found another crash but I'm not sure how to fix it:

The lines

print('Adding new event %s at %s: %s' % (data['itemid'], data['eventtime'], data['subject']))

and

print('Updating event %s at %s: %s' % (data['itemid'], data['eventtime'], data['subject']))

can crash with errors like UnicodeEncodeError: 'cp932' codec can't encode character '\xe2' in position 54: illegal multibyte sequence

I don't know how to fix that, but removing the printing of data['subject'] works around it.

Possible note that since it uses the original image rather than also downloading the thumbnail, the image shown in the rendered HTML page can be very large and run off the web page if the original picture is high resolution but had a small thumbnail in the journal entry. Since that is different than the patch, I'm guessing that may be a deliberate decision though.

If I read it right I think this line sends the cookie unconditionally even for non-Dreamwidth images, just to note that I don't think that follows the cookie security that a browser would have, where the cookie is private information shared only to the original host. It may be a security consideration to leak it to other hosts, though I don't know how DW uses that cookie.

Thank you again for the script, and sorry to mention a lot of things!

Edited (added notes about thumbnail and cookie may be shared to other hosts) 2024-05-21 23:57 (UTC)

Thanks for your work on this!

Yeah I think you're right about the encoding error... Something in the subject line. I currently treat the entry content carefully with respect to encoding conversion because it can be from all kinds of origins, but other short data fields like "music" and "subject" are handled a little more simply, which needs to change...

One thing that makes it especially difficult is Dreamwidth does some of its own internal conversion when rendering a journal, so diagnosing the problem isn't as simple as going to the entry on the Dreamwidth site and having a look...

What's the subject line that causes the crash? I know pasting it in here will probably convert it into unicode, but maybe it will give me a clue...

Extracting that cookie information is a complicated process. :D Instead of re-authenticating with a Captcha, what happens when you use the ljuniq cookie that's stored in the browser when you're already authenticated? Like, right now I can go to Developer Window->Storage->Cookies in Safari and see an ljuniq cookie... Could I just paste that into the script?

Sorry I can't test this myself... I don't have any images hosted on DW...

Sorry it took me forever to write back!

The crashing entry title was "I haven’t been on reddit in ages, I got so much from it but honestlly was ready to move on ig". It didn't crash on linux, only on windows. I suspect it's the unicode quotation mark in "haven’t" breaking it, but haven't confirmed that.

Yes, when I used the very latest ljuniq cookie value from the last request in the developer tools from my normal logged in window, it seemed to work, so I wonder if it's just that I was grabbing ones that were too old before.

Thanks again for the useful tool and your replies!

Have encoding problems with my journal :-/

I'm trying to fetch it from the livejournal.com api directly (didn't try with DW yet cause I want to try to fetch from source first)

And I get weird characters like this:
Adding new event 2 at 2004-02-24T00:44:00+00:00: ÑÐ°ÑÑÑÐºÐ¸...

Which should be this: https://vicnaum.livejournal.com/623.html

I've read that you should go to settings/OldEncoding or smth and change it to cp1251 for Cyrilic (Windows), but this page doesn't exist anymore...

Interesting, is API returning it already broken, or it can be fixed within Python still?

Unfortunately I'm not sure how to fix this issue. It's possible that LJ sends the encoding information as part of the XML when one fetches an entry, e.g.:

<?xml version="1.0" encoding="WINDOWS-1251"?> ..... </xml>

and if so, that can be used to decide what encoding to use when converting it to Unicode. But right now, unless there's some magic happening in the Python XML parser I don't know about, it always assumes UTF-8 so stuff in e.g. WINDOWS-1251 will get mangled.

LJ renders it just fine when presenting its own web interface, so either LJ preserves the encoding information internally, or it follows some kind of guessing procedure to convert it to UTF-8. One could theoretically answer that question by crawling through the LJ source code.

Hi there! Thank you for reviving ljdump. I saw that there have been updates over the past number of months so I'm very appreciative of your efforts.

I've just started attempting to backup my old LJ but have come up with this error which I'm not sure what has caused it. Maybe it's because I have Private posts? It's been a long time since I've used LJ so I've probably forgotten what privacy categories exist. Here's the error, it appears after adding 'moods':

Adding new mood with name: jealous
Adding new mood with name: nervous
Traceback (most recent call last):
File "C:\Users\Anthony\Downloads\ljdump-1.7.5\ljdump.py", line 488, in
ljdump(
File "C:\Users\Anthony\Downloads\ljdump-1.7.5\ljdump.py", line 348, in ljdump
'security_protected': t['security']['protected'],
KeyError: 'protected'

Interesting! Well if there's no key there, then there's no 'protected' status to report, true or false...
So it may not be about the entry being protected, it may be about the data coming from the server just being weirdly incomplete. Maybe some old entry before a schema change, or some weird side-effect of an import.
Thanks for the info! I'll see if I can cook up a patch for this later today.

Any luck with the script? :)

Okay, I think I might have fixed this issue in v.1.7.6 ... Give it a shot?
https://github.com/GBirkel/ljdump/releases

Great! Thank you! It seems to be fine after a few runs of the script, though it looks like I've hit a limit and will need to wait an hour! Seems that it happened after around 1000 posts. Not sure if there's anything you can do about that, but just so you know! So far I have 2004-Feb to 2006-Mar of my LJ and... still have up to 2010 to get to. 😁😂

Error getting item: L-1018
<Fault 404: "Client error: Cannot post: You've exceeded a posting limit and will be able to continue posting within an hour.">
Fetching journal entry L-1019 (create)
Error getting item: L-1019
<Fault 404: "Client error: Cannot post: You've exceeded a posting limit and will be able to continue posting within an hour.">
Fetching journal entry L-1020 (create)
Error getting item: L-1020
<Fault 404: "Client error: Cannot post: You've exceeded a posting limit and will be able to continue posting within an hour.">

Huh! Interesting. Another error I haven't encountered... Hopefully it will get through the whole set eventually.

I'm a bit unsure of how this works, what does the script need to post? I've also had to run it a few times as it only seems to grab around 250-300 posts at a time before I need to re-run it. And I'm not sure it is starting from where it previously ended.

Is there a way to turn on logs for you to look at further?

It's not posting anything. It's using the post management interface to read posts. Apparently LJ has some kind of strict throttling that Dreamwidth doesn't have. Or perhaps it's the difference between paid and non-paid accounts??

The script output mentions items, like "L-1018", "L-1019", etc. Each of those is either an entry or a change to an entry. When you run it again, do those L numbers change? Or does it always mention the same set?

It seems my computer had restarted in the time I last ran the script, so I can't scrollback through the output.

I did check my 'entries' folder and there seem to be 2249 html files in there but the last post is 'entry-2418.html' so I'm not sure if either numbers were skipped or entries were skipped?

Is there a way to create detailed logs? I'll try to run this script from scratch again.

There are often gaps in entry ID numbers in people's journals. That's normal.

Hi! Thank you so much for creating this. I am very much not a programmer, but I ran the script several times and have seen that the log shows all of my entries are archived. However, I am getting an error, and I don't see any HTML files, just a .db file that I don't know how to open.

This is the last bit of output:

Traceback (most recent call last):
File "./ljdump.py", line 501, in
ljdump(
File "./ljdump.py", line 430, in ljdump
ljdumptohtml(
File "/Users/brandie 1/Downloads/ljdump-1.7.6/ljdumptohtml.py", line 765, in ljdumptohtml
cached_image = get_or_create_cached_image_record(cur, verbose, url_to_cache, entry_date)
File "/Users/brandie 1/Downloads/ljdump-1.7.6/ljdumpsqlite.py", line 832, in get_or_create_cached_image_record
cur.execute("""
sqlite3.OperationalError: near "RETURNING": syntax error

Huh, that's an interesting one. An SQL language error. If I'm reading this right, the version of SQLite you're using doesn't support the "RETURNING" clause, which was introduced in version 3.35 (2021-03-12). Tell me some details about the computer you're running this on?

It's an old one!! It's an iMac from late 2015 running Catalina (10.15.7). I think it's possible to update the OS to Monterey but I was worried about performance issues.

Well, I can think of one thing to try, but it depends on your level of comfort with command-line stuff. You can install homebrew, and then use it to install a more up-to-date python3, e.g. "brew install python3"...

Hi! Thank you for your work on this script. I've run into an error while trying to back-up my journal:

Traceback (most recent call last):
  File "-\ljdump.py", line 501, in 
    ljdump(
  File "-\ljdump.py", line 177, in ljdump
    insert_or_update_event(cur, verbose, ev)
  File "-\ljdumpsqlite.py", line 417, in insert_or_update_event
    cur.execute("""
sqlite3.InterfaceError: Error binding parameter :props_taglist - probably unsupported type.

I have tried re-running the script multiple times but it always errors out and stops at the same entry with this error.

Is this dreamwidth or livejournal?
How old is the entry?
Is there more than one tag assigned to that entry?
Anything unique about those tags?

DW entry from 2009, it was imported from LJ where it had tags but I have removed all tags from it at some point (one of the original tags has a "♥" symbol in it).

Hmm... I have similar entries myself. Although perhaps my entries from LJ didn't have more than one tag, since I only started getting into tags in a big way after the DW migration. So, I wonder if perhaps this particular tags data was brought over as an array...

The other possibility is that somehow the tag is being treated as binary data because of some unicode snafu, and the field wants either null or a string...
I'm going to run a brief experiment and get back to you.

Aha. I put the tag "ճնքաքճ🥸" on my latest entry, and go the same error. :D

Okay, here's a new release with a minor change:

https://github.com/GBirkel/ljdump/releases/tag/v1.7.7

I just re-downloaded my whole journal with it, and it appears to be treating unicode in tags more correctly. :D

Thank you so much for taking the time to work on it! I'm running it now and while it's still going (got a lot of entries...) it successfully downloaded past the entry that kept stopping it.

w00t!

Apologies for coming back with another report... It seems that the script is timing out while trying to fetch comments? (I didn't even realize it would back-up those.)

Fetching journal comments for: falkner
*** Error fetching comment meta, possibly not community maintainer?
*** HTTP Error 504: Gateway Time-out
Fetching current users map from database
Traceback (most recent call last):
  File "-\ljdump.py", line 501, in 
    ljdump(
  File "-\ljdump.py", line 245, in ljdump
    usermap = get_users_map(cur, verbose)
  File "-\ljdumpsqlite.py", line 780, in get_users_map
    cur.execute("SELECT id, name FROM users_map")
sqlite3.ProgrammingError: Cannot operate on a closed cursor.

Hmm... "Gateway timeout" makes me think it's some kind of throttling happening server-side. Did this happen during a particularly long run of the script?

I run the script with -f and after downloading some entries (quickly, no problem there) it seemingly got stuck on the comments. Could it be because there's 10k+ comments on my journal? Is there a way I can finish downloading entries and exclude comment back-up?

I sat down and had a look at this... The comment fetching code is based heavily on the old ljdump code, and it appears to "catch up" with un-fetched comments all at once, no matter how many comments that could potentially be.

So, on the initial run, it will attempt to fetch the ID of every comment in the journal, and then attempt to fetch the body of every comment it gets an ID for, which could potentially be many thousands.

To allow it to fetch comments in intervals I'll need to do some re-wiring. Bear with me...

No hurry, of course. Thank you for taking the time to look into this at all!

Okay, there's a version 1.7.8 up now, that fetches comments using the same strategy as it fetches entries. It will grab 400 of each at a time by default, but you can set that to some other number, e.g. "--max 1500". Hopefully this will let you grab comments in batches small enough to appease the mighty server gods.

Unfortunately the database schema has changed very slightly from the earlier one, so you'll need to start the journal download from scratch again, instead of using whatever you've currently got.

First thanks; pulled old LJ on Mac OS and am so thankful to have it.
Second: the script skipped about ~7% of entries. Is there a way to re-run that would get it to re-try these non-pulled entries, rather than start from last pull/looking for new entries?

Unfortunately, it's out of my (the script's) control which entries LJ decides are available in the history API, so if any are missing there isn't much I can do. (Note that it's normal for there to be gaps in the ID numbers that the script appears to be fetching and that's not an indication of a skipped entry.) Do you have an error log or an example entry you can point me at?

Thank you for doing this and keeping it up!

You're quite welcome! Let me know if you hit any snags.

Also: Intrigued: What is "cat tipping"?

Oh god, I haven't updated my journal profiles in decades. I once had a cat that I liked to gently push with my foot and he'd just knock right over and then I'd pet his belly. Most cats grip the floor and refuse to tip.

Hah! Yeah, my profile is frozen in time as of about 20 years ago.
Sounds like an adorable cat. Of the three currently in my life, one is too old and distinguished to push over, one would wander sullenly away, and the other would gently begin murdering my foot. Viva variety!

I miss cats, haven't had any in ages.

Question: If I edit an old post that I have already downloaded, does the program go back and make that update? I never finished tagging old posts and it's a work in progress.

Yes. The Livejournal-inspired software makes an “event log” that includes events like “edited an older entry”, and the script works by catching up with the event log.

I post a lot of backdated stuff so it’s a pretty important feature for me. :D

Thank you again! I used to use LJdump but it didn't work properly with DW when tried a few years ago, so I'm so glad I found this, and much improved, no less. Was actually getting ready to manually save my pages as PDFs or something desperate like that. Can I buy you a virtual cup of coffee or something?

Huh... You know, I've never set up any kind of donationey platform for anything I've done. Tell ya what... Next time you find yourself on the West Coast, in the Berkeley area, I'll meet you and whatever folks you have with you, and you can buy me an iced mocha at the best iced mocha place around (The Baker And Commons Cafe).

Until then, don't worry about it. :)

Got it! I was in Berkeley (stayed at the Maida Center for a retreat) about 8 years ago and loved the Bay area, you never know when I might show up again.

Thank you so much for this ! I tried many things but couldn't make them work. Your script is very easy to understand and I got to make a back-up of all my entries !

You're very welcome!

Hello,

Thank you for creating such an amazing tool! I wanted to backup my dreamwidth journal locally, but I couldn't figure out the image cache setting.

It works when I double click ljdump.py, but the images aren't cached. When I open terminal and type in this code, ./ljdump.py --cache_images the terminal just opens and closes a window, but nothing happens. I'm not really sure what I'm doing wrong...

Hello! It sounds like you're using the terminal on Windows... Unfortunately I haven't used that environment in years, so you may need to grab someone who can sit down at the keyboard with you and investigate more directly.

So I looked into it a little bit further, and decided to get an ljuniq cookie from dreamwidth and paste it into my config file. Apparently that did the trick, and my images are cached now! Not all of them, but most of them at least.

Excellent! Glad it worked.

Thank you so much for making this script! I have an error that came up that nobody else has mentioned in the thread, and I'm stumped, because while I know what this error MEANS, not so in this context:

Traceback (most recent call last):
File "./ljdump.py", line 508, in
ljdump(
File "./ljdump.py", line 91, in ljdump
ljsession = getljsession(journal_server, username, password)
File "./ljdump.py", line 55, in getljsession
r = urllib.request.urlopen(journal_server+"/interface/flat", data=data)
File "/usr/lib/python3.10/urllib/request.py", line 216, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.10/urllib/request.py", line 525, in open
response = meth(req, response)
File "/usr/lib/python3.10/urllib/request.py", line 634, in http_response
response = self.parent.error(
File "/usr/lib/python3.10/urllib/request.py", line 563, in error
return self._call_chain(*args)
File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain
result = func(*args)
File "/usr/lib/python3.10/urllib/request.py", line 643, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

I am using Linux Mint and new to Linux, so am totally willing to bet this is something totally obvious that I just don't know.

A possibility... For Dreamwidth, I put "https://dreamwidth.org" in the server part of the configuration file... Is that what you're using?

https://github.com/GBirkel/ljdump/blob/master/ljdump.config.sample

...omfg I think I put a www in there through force of habit. How fucking embarrassing.

I KNEW it must be something stupidly obvious! Will check when back on machine, thank you so much.

I am back once again...

This error seems to occur if there's unicode characters in the display name (in my case it was "♪"):

[Running it on Windows this time around.] I got around it by temporarily removing the character in the name, but I'm reporting anyway in case anybody else comes across this issue.

First: Thank you so much for doing this, the tool is amazing.

Second: Probably a dumb question, but: so I messed around a little with the html of the index/table of contents and saved the style sheet from my current dreamwidth journal in the folder -- https://wembley.dreamwidth.org/res/4147489/stylesheet?1745745106 -- but I can't get the darker blue left-hand sidebar to appear with the list of tags and such. I'm not very tech-savvy, I just know a teeny bit of HTML and I don't really understand CSS. Do you know what code I should add in to put the sidebar back?

Third, about the actual tool again: A really awesome person backed up my Livejournal with a different method and it worked really well, but I wanted to test out your version of LJDump just for fun and see how it worked on my old LJ, since it worked really well on my DW (though I had to do the max-1500 thing or it would get rate-limited). When I tried it on my Livejournal (wemblee.livejournal.com), this happened:

Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 1348, in do_open
h.request(req.get_method(), req.selector, req.data, headers,
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 1286, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 1332, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 1281, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 1041, in _send_output
self.send(msg)
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 979, in send
self.connect()
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 1458, in connect
self.sock = self._context.wrap_socket(self.sock,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/ssl.py", line 517, in wrap_socket
return self.sslsocket_class._create(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/ssl.py", line 1075, in _create
self.do_handshake()
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/ssl.py", line 1346, in do_handshake
self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1002)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Applications/Downloader apps/Livejournal downloaders/ljdump-1.7.9/ljdump.py", line 508, in
ljdump(
File "/Applications/Downloader apps/Livejournal downloaders/ljdump-1.7.9/ljdump.py", line 91, in ljdump
ljsession = getljsession(journal_server, username, password)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Applications/Downloader apps/Livejournal downloaders/ljdump-1.7.9/ljdump.py", line 55, in getljsession
r = urllib.request.urlopen(journal_server+"/interface/flat", data=data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 216, in urlopen
return opener.open(url, data, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 519, in open
response = self._open(req, data)
^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 536, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 496, in _call_chain
result = func(*args)
^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 1391, in https_open
return self.do_open(http.client.HTTPSConnection, req,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 1351, in do_open
raise URLError(err)
urllib.error.URLError: urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1002)

Btw, this part:

urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1002)

had the "<" ">" brackets around it but when I kept those in this comment, DW didn't like it and said it was an unclosed HTML bracket, so just FYI.

Thank you so much for updating this program, it really is awesome of you. <3

Edited 2025-04-27 15:39 (UTC)

Flat | Top-Level Comments Only

Updated Dreamwidth backup script

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

<3

Re: <3

Re: <3

Re: <3

Re: <3

Re: <3

Re: <3

KeyError: 'protected'

Re: KeyError: 'protected'

Re: KeyError: 'protected'

Re: KeyError: 'protected'

Re: KeyError: 'protected'

Re: KeyError: 'protected'

Re: KeyError: 'protected'

Re: KeyError: 'protected'

Re: KeyError: 'protected'

Re: KeyError: 'protected'

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

Thank you + Q

Re: Thank you + Q

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

Images unable to cache?

Re: Images unable to cache?

Re: Images unable to cache?

Re: Images unable to cache?

no subject

no subject

no subject

no subject

no subject