Updated Dreamwidth backup script
Mar. 24th, 2024 09:55 pm
For quite a while I've been looking for some nice way to get a complete backup of my Dreamwidth content onto my local machine. And I gotta wonder... Is this not a very popular thing? There are a lot of users on here, posting a lot of cool and unique content. Wouldn't they want to have a copy, just in case something goes terribly wrong?I found a Python script that does a backup, and was patched to work with Dreamwidth, but the backup took the form of a huge pile of XML files. Thousands of them. I wanted something more flexible, so I forked the script and added an optional flag that writes everything (entries, comments, userpic info) to a single SQLite database.
https://github.com/GBirkel/ljdump
Folks on MacOS can just grab the contents of the repo and run the script. All the supporting modules should already be present in the OS. Windows people will need to install some version of Python.
For what it's worth, here's the old discussion forum for the first version of the script, released way back around 2009.
Update, 2024-03-25:
The script now also downloads and stores tag and mood information.
Update, 2024-03-26:
After synchronizing, the script now generates browseable HTML files of the journal, including entries for individual pages with comment threads, and linked history pages showing 20 entries at a time.
Moods, music, tags, and custom icons are shown for the entries where applicable.
Currently the script uses the stylesheet for my personal journal (this one), but you can drop in the styles for yours and it should accept them. The structure of the HTML is rendered as close as possible to what Dreamwidth makes.
Update, 2024-03-28:
The script can also attempt to store local copies of the images embedded in journal entries. It organizes them by month in an images folder next to all the HTML. This feature is enabled with a "--cache_images" argument.
Every time you run it, it will attempt to cache 200 more images, going from oldest to newest. It will skip over images it's already tried and failed to fetch, until 24 hours have gone by, then it will try those images once again.
The image links in your entries are left unchanged in the database. They're swapped for local links only in the generated HTML pages.
Update, 2024-04-02:
The script is now ported to Python 3, and tested on both Windows and MacOS. I've added new setup instructions for both that are a little easier to follow.
Update, 2024-04-30:
Added an option to stop the script from trying to cache images that failed to cache once already.
2024-06-26: Version 1.7.6
Attempt to fix music field parsing for some entries.
Fix for crash on missing security properties for some entries.
Image fetch timeout reduced from 5 seconds to 4 seconds.
2024-08-14: Version 1.7.7
Slightly improves unicode handling in tags and the music field.
2024-09-07: Version 1.7.8
Changes "stop at fifty" command line flag to a "max n" argument, with a default of 400, and applies it to comments as well as entries. This may help people who have thousands of comments complete their initial download. I recommend using the default at least once, then using a value of 1500 afterward until you're caught up.
2024-09-18: Version 1.7.9
Table of contents for the table of contents!
First version of an "uncached images" report to help people find broken image links in their journal.
2025-09-25: Version 1.8
Added some more unicode guardrails for things like icon keywords and the journal name.
no subject
Date: 2024-03-25 08:47 am (UTC)Thank you!
no subject
Date: 2024-03-26 04:50 am (UTC)no subject
Date: 2024-03-27 08:08 am (UTC)no subject
Date: 2024-03-27 08:24 am (UTC)I'll have to, yes.
no subject
Date: 2024-05-10 06:40 pm (UTC)no subject
Date: 2024-05-10 07:31 pm (UTC)no subject
Date: 2024-05-10 07:59 pm (UTC)<3
Date: 2024-05-11 10:51 pm (UTC)line 823, in get_or_create_cached_image_record date_or_none = date_first_seen.strftime('%s') ValueError: Invalid format stringGot around it by adding
import timewith the other imports and changing line 823 (now 824) todate_or_none = time.mktime(date_first_seen.timetuple())(fix stolen from here, dunno if it's a good fix tho)EDIT: I also ended up making some more changes to download images hosted on Dreamwidth, also in their original resolution - patch file below in case its handy.
Edit again: fix running ljdumptohtml.py alone, allow images to have attributes between <img and src="
Patch file
--- C:/dw/ljdump-1.7.4/ChangeLog Sat May 11 00:00:00 2024 +++ C:/dw/ljdump-1.7.4-patched/ChangeLog Sat May 11 00:00:00 2024 --- C:/dw/ljdump-1.7.4/ljdump.config.sample Sat May 11 00:00:00 2024 +++ C:/dw/ljdump-1.7.4-patched/ljdump.config.sample Sat May 11 00:00:00 2024 @@ -3,4 +3,5 @@ <server>https://livejournal.com</server> <username>myaccount</username> <password>mypassword</password> + <ljuniq>ljuniq cookie if you want to download Dreamwidth hosted images</ljuniq> </ljdump> --- C:/dw/ljdump-1.7.4/ljdump.py Sat May 11 00:00:00 2024 +++ C:/dw/ljdump-1.7.4-patched/ljdump.py Sat May 11 00:00:00 2024 @@ -70,7 +70,7 @@ return e[0].firstChild.nodeValue -def ljdump(journal_server, username, password, journal_short_name, verbose=True, stop_at_fifty=False, make_pages=False, cache_images=False, retry_images=True): +def ljdump(journal_server, username, password, ljuniq, journal_short_name, verbose=True, stop_at_fifty=False, make_pages=False, cache_images=False, retry_images=True): m = re.search("(.*)/interface/xmlrpc", journal_server) if m: @@ -417,6 +417,7 @@ ljdumptohtml( username=username, journal_short_name=journal_short_name, + ljuniq=ljuniq, verbose=verbose, cache_images=cache_images, retry_images=retry_images @@ -444,6 +445,11 @@ password = password_els[0].childNodes[0].data else: password = getpass("Password: ") + ljuniq_els = config.documentElement.getElementsByTagName("ljuniq") + if len(ljuniq_els) > 0: + ljuniq = ljuniq_els[0].childNodes[0].data + else: + ljuniq = getpass("ljuniq cookie (for Dreamwidth hosted image downloads, leave blank otherwise): ") journals = [e.childNodes[0].data for e in config.documentElement.getElementsByTagName("journal")] if not journals: journals = [username] @@ -457,6 +463,7 @@ print username = input("Username: ") password = getpass("Password: ") + ljuniq = getpass("ljuniq cookie (for Dreamwidth hosted image downloads, leave blank otherwise): ") print print("You may back up either your own journal, or a community.") print("If you are a community maintainer, you can back up both entries and comments.") @@ -474,6 +481,7 @@ journal_server=journal_server, username=username, password=password, + ljuniq=ljuniq, journal_short_name=journal, verbose=args.verbose, stop_at_fifty=args.fifty, --- C:/dw/ljdump-1.7.4/ljdumpsqlite.py Sat May 11 00:00:00 2024 +++ C:/dw/ljdump-1.7.4-patched/ljdumpsqlite.py Sat May 11 00:00:00 2024 @@ -30,6 +30,8 @@ from sqlite3 import Error from xml.sax import saxutils from builtins import str +import time +import re # Subclass of tzinfo swiped mostly from dateutil @@ -803,6 +805,10 @@ SELECT id, url, filename, date_first_seen, date_last_attempted, cached FROM cached_images WHERE url = :url""", {'url': image_url}) row = cur.fetchone() + pattern = re.compile('^https://(\w+).dreamwidth.org/file/\d+x\d+/(.+)') + if pattern.match(image_url): + result = pattern.search(image_url) + get_or_create_cached_image_record(cur, verbose, 'https://' + result.group(1) + '.dreamwidth.org/file/' + result.group(2), date_first_seen) if row: if verbose: print('Found image cache record for: %s' % (image_url)) @@ -820,7 +826,7 @@ print('Creating image cache record for: %s' % (image_url)) date_or_none = None if date_first_seen: - date_or_none = date_first_seen.strftime('%s') + date_or_none = time.mktime(date_first_seen.timetuple()) data = { "id": None, "url": image_url, --- C:/dw/ljdump-1.7.4/ljdumptohtml.py Sat May 11 00:00:00 2024 +++ C:/dw/ljdump-1.7.4-patched/ljdumptohtml.py Mon May 13 23:37:54 2024 @@ -385,12 +385,18 @@ def resolve_cached_image_references(content, image_urls_to_filenames): # Find any image URLs - urls_found = re.findall(r'img[^\"\'()<>]*\ssrc\s?=\s?[\'\"](https?:/+[^\s\"\'()<>]+)[\'\"]', content, flags=re.IGNORECASE) + urls_found = re.findall(r'<img[^<>]*\ssrc\s?=\s?[\'\"](https?:/+[^\s\"\'()<>]+)[\'\"]', content, flags=re.IGNORECASE) # Find the set of URLs that have been resolved to local files resolved_urls = [] for image_url in urls_found: if image_url in image_urls_to_filenames: resolved_urls.append(image_url) + pattern = re.compile('^https://(\w+).dreamwidth.org/file/\d+x\d+/(.+)') + if pattern.match(image_url): + result = pattern.search(image_url) + full_url = 'https://' + result.group(1) + '.dreamwidth.org/file/' + result.group(2) + if full_url in image_urls_to_filenames: + resolved_urls.append(full_url) # Swap them in for image_url in resolved_urls: filename = image_urls_to_filenames[image_url] @@ -630,10 +636,15 @@ return html_as_string -def download_entry_image(img_url, journal_short_name, subfolder, url_id): +def download_entry_image(img_url, journal_short_name, subfolder, url_id, entry_url, ljuniq): try: - image_req = urllib.request.urlopen(img_url, timeout = 5) + pattern = re.compile('^https://(\w+).dreamwidth.org/file/') + headers = {} + if pattern.match(img_url): + headers = {'Referer': entry_url, 'Cookie': "ljuniq="+ljuniq} + image_req = urllib.request.urlopen(urllib.request.Request(img_url, headers = headers), timeout = 5) if image_req.headers.get_content_maintype() != 'image': + print('Content type was not expected, image skipped: ', img_url, image_req.headers.get_content_maintype()) return (1, None) extension = MimeExtensions.get(image_req.info()["Content-Type"], "") @@ -681,7 +692,7 @@ return (1, None) -def ljdumptohtml(username, journal_short_name, verbose=True, cache_images=True, retry_images=True): +def ljdumptohtml(username, journal_short_name, ljuniq, verbose=True, cache_images=True, retry_images=True): if verbose: print("Starting conversion for: %s" % journal_short_name) @@ -741,8 +752,17 @@ e_id = entry['itemid'] entry_date = datetime.utcfromtimestamp(entry['eventtime_unix']) entry_body = entry['event'] - urls_found = re.findall(r'img[^\"\'()<>]*\ssrc\s?=\s?[\'\"](https?:/+[^\s\"\'()<>]+)[\'\"]', entry_body, flags=re.IGNORECASE) + initial_urls_found = re.findall(r'<img[^<>]*\ssrc\s?=\s?[\'\"](https?:/+[^\s\"\'()<>]+)[\'\"]', entry_body, flags=re.IGNORECASE) subfolder = entry_date.strftime("%Y-%m") + urls_found = [] + for image_url in initial_urls_found: + urls_found.append(image_url) + pattern = re.compile('^https://(\w+).dreamwidth.org/file/\d+x\d+/(.+)') + if pattern.match(image_url): + result = pattern.search(image_url) + full_url = 'https://' + result.group(1) + '.dreamwidth.org/file/' + result.group(2) + urls_found.append(full_url) + for image_url in urls_found: cached_image = get_or_create_cached_image_record(cur, verbose, image_url, entry_date) try_cache = True @@ -758,7 +778,7 @@ image_id = cached_image['id'] cache_result = 0 img_filename = None - (cache_result, img_filename) = download_entry_image(image_url, journal_short_name, subfolder, image_id) + (cache_result, img_filename) = download_entry_image(image_url, journal_short_name, subfolder, image_id, entry['url'], ljuniq) if (cache_result == 0) and (img_filename is not None): report_image_as_cached(cur, verbose, image_id, img_filename, entry_date) image_resolve_max -= 1 @@ -955,6 +975,11 @@ journals = [e.childNodes[0].data for e in config.documentElement.getElementsByTagName("journal")] if not journals: journals = [username] + ljuniq_els = config.documentElement.getElementsByTagName("ljuniq") + if len(ljuniq_els) > 0: + ljuniq = ljuniq_els[0].childNodes[0].data + else: + ljuniq = getpass("ljuniq cookie (for Dreamwidth hosted image downloads, leave blank otherwise): ") else: print("ljdumptohtml - livejournal (or Dreamwidth, etc) archive to html utility") print @@ -976,6 +1001,7 @@ ljdumptohtml( username=username, journal_short_name=journal, + ljuniq=ljuniq, verbose=args.verbose, cache_images=args.cache_images, retry_images=args.retry_images --- C:/dw/ljdump-1.7.4/README.md Sat May 11 00:00:00 2024 +++ C:/dw/ljdump-1.7.4-patched/README.md Sat May 11 00:00:00 2024 --- C:/dw/ljdump-1.7.4/stylesheet.css Sat May 11 00:00:00 2024 +++ C:/dw/ljdump-1.7.4-patched/stylesheet.css Sat May 11 00:00:00 2024Re: <3
Date: 2024-05-21 06:59 am (UTC)https://github.com/GBirkel/ljdump/releases/tag/v1.7.5
Re: <3
Date: 2024-05-21 10:56 pm (UTC)I get the ljuniq cookie by opening a new private browsing window, opening the network request section of the browser window, going to https://www.dreamwidth.org/ then solving the CAPTCHA. It redirects back to the homepage which returns a set-cookie HTTP header shown in the developer tools, the line looks something like
Long comment, click to open...
(I replaced parts of it with "x" in that since it's just an example.)
The part that looks like
ewp0jLxxxx97IQp%3A171xxxx687is the important part, I copy that and paste it as the <ljuniq> value in the configuration file. Note the%3Ain the cookie I think needs to be changed to a:(since it's "URL encoded")So the end configuration option looks like
<ljuniq>ewp0jLxxxx97IQp:171xxxx687</ljuniq>That cookie tends to expire really easily in my experience (or maybe it's my user error, I'm bad at doing things right). I'm not sure what the conditions are, but it sometimes takes me a few tries to get it to let me download the images. If it fails, it'll say "Content type text not expected, image skipped" for the image that didn't work. I've used commenting out the logic at the end of the "Respect the global image cache setting" bit in ljdumptohtml.py and then running only ljdumptohtml.py retry the images that failed without waiting 24h or deleting the cache rows from the database manually.
Unrelated if it helps, found another crash but I'm not sure how to fix it:
The lines
print('Adding new event %s at %s: %s' % (data['itemid'], data['eventtime'], data['subject']))and
print('Updating event %s at %s: %s' % (data['itemid'], data['eventtime'], data['subject']))can crash with errors like
UnicodeEncodeError: 'cp932' codec can't encode character '\xe2' in position 54: illegal multibyte sequenceI don't know how to fix that, but removing the printing of data['subject'] works around it.
Possible note that since it uses the original image rather than also downloading the thumbnail, the image shown in the rendered HTML page can be very large and run off the web page if the original picture is high resolution but had a small thumbnail in the journal entry. Since that is different than the patch, I'm guessing that may be a deliberate decision though.
If I read it right I think this line sends the cookie unconditionally even for non-Dreamwidth images, just to note that I don't think that follows the cookie security that a browser would have, where the cookie is private information shared only to the original host. It may be a security consideration to leak it to other hosts, though I don't know how DW uses that cookie.
Thank you again for the script, and sorry to mention a lot of things!
Re: <3
Date: 2024-05-22 12:47 am (UTC)Yeah I think you're right about the encoding error... Something in the subject line. I currently treat the entry content carefully with respect to encoding conversion because it can be from all kinds of origins, but other short data fields like "music" and "subject" are handled a little more simply, which needs to change...
One thing that makes it especially difficult is Dreamwidth does some of its own internal conversion when rendering a journal, so diagnosing the problem isn't as simple as going to the entry on the Dreamwidth site and having a look...
What's the subject line that causes the crash? I know pasting it in here will probably convert it into unicode, but maybe it will give me a clue...
Extracting that cookie information is a complicated process. :D Instead of re-authenticating with a Captcha, what happens when you use the ljuniq cookie that's stored in the browser when you're already authenticated? Like, right now I can go to Developer Window->Storage->Cookies in Safari and see an ljuniq cookie... Could I just paste that into the script?
Sorry I can't test this myself... I don't have any images hosted on DW...
KeyError: 'protected'
Date: 2024-06-01 12:24 pm (UTC)I've just started attempting to backup my old LJ but have come up with this error which I'm not sure what has caused it. Maybe it's because I have Private posts? It's been a long time since I've used LJ so I've probably forgotten what privacy categories exist. Here's the error, it appears after adding 'moods':
Adding new mood with name: jealous
Adding new mood with name: nervous
Traceback (most recent call last):
File "C:\Users\Anthony\Downloads\ljdump-1.7.5\ljdump.py", line 488, in
ljdump(
File "C:\Users\Anthony\Downloads\ljdump-1.7.5\ljdump.py", line 348, in ljdump
'security_protected': t['security']['protected'],
KeyError: 'protected'
Re: <3
Date: 2024-06-17 12:18 am (UTC)The crashing entry title was "I haven’t been on reddit in ages, I got so much from it but honestlly was ready to move on ig". It didn't crash on linux, only on windows. I suspect it's the unicode quotation mark in "haven’t" breaking it, but haven't confirmed that.
Yes, when I used the very latest ljuniq cookie value from the last request in the developer tools from my normal logged in window, it seemed to work, so I wonder if it's just that I was grabbing ones that were too old before.
Thanks again for the useful tool and your replies!
Re: KeyError: 'protected'
Date: 2024-06-23 08:15 am (UTC)So it may not be about the entry being protected, it may be about the data coming from the server just being weirdly incomplete. Maybe some old entry before a schema change, or some weird side-effect of an import.
Thanks for the info! I'll see if I can cook up a patch for this later today.
Re: KeyError: 'protected'
Date: 2024-06-24 02:01 am (UTC)no subject
Date: 2024-06-27 02:04 am (UTC)https://github.com/GBirkel/ljdump/releases
Re: KeyError: 'protected'
Date: 2024-06-27 02:04 am (UTC)https://github.com/GBirkel/ljdump/releases
no subject
Date: 2024-06-27 05:10 pm (UTC)What it seems to have exchanged in return is to render those glyphs with diacritical marks in mojibake instead of their correct glyphs, and it may or may not be rendering HTML escape codes correctly or well, but it is definitely not erroring out on those posts that it was erroring before. The solution you implemented is at least functional.
no subject
Date: 2024-06-28 05:47 pm (UTC)This is the last bit of output:
Traceback (most recent call last):
File "./ljdump.py", line 501, in
ljdump(
File "./ljdump.py", line 430, in ljdump
ljdumptohtml(
File "/Users/brandie 1/Downloads/ljdump-1.7.6/ljdumptohtml.py", line 765, in ljdumptohtml
cached_image = get_or_create_cached_image_record(cur, verbose, url_to_cache, entry_date)
File "/Users/brandie 1/Downloads/ljdump-1.7.6/ljdumpsqlite.py", line 832, in get_or_create_cached_image_record
cur.execute("""
sqlite3.OperationalError: near "RETURNING": syntax error
no subject
Date: 2024-06-28 05:54 pm (UTC)no subject
Date: 2024-06-28 06:22 pm (UTC)Re: KeyError: 'protected'
Date: 2024-06-29 06:07 pm (UTC)Error getting item: L-1018
<Fault 404: "Client error: Cannot post: You've exceeded a posting limit and will be able to continue posting within an hour.">
Fetching journal entry L-1019 (create)
Error getting item: L-1019
<Fault 404: "Client error: Cannot post: You've exceeded a posting limit and will be able to continue posting within an hour.">
Fetching journal entry L-1020 (create)
Error getting item: L-1020
<Fault 404: "Client error: Cannot post: You've exceeded a posting limit and will be able to continue posting within an hour.">
Re: KeyError: 'protected'
Date: 2024-06-29 07:45 pm (UTC)no subject
Date: 2024-06-29 07:49 pm (UTC)Re: KeyError: 'protected'
Date: 2024-06-30 12:14 pm (UTC)Is there a way to turn on logs for you to look at further?