Cleanup on Aisle Three - Bamboo Cyberdream

This weekend I found some strange files floating around my Dreamhost directories. I was hacked! My web page is pretty simple and non-dynamic, but I have a bunch of other sites that I host for friends and other side projects, and the current guess is that the hackers got in through an outdated version of Wordpress or the super old installation of Mediawiki that was hanging around.

(Worth noting that the folks at Dreamhost gave some top-notch customer service, helping me figure out where the hack came from, what it had done, and cleaning up the damage. Yeah, they’re a budget host, but I’ve always been pretty impressed as what they let you do and how they help you out when you need it.)

On top of updating all of the old web apps and setting them to automatically update in the future, I started looking into making the whole shebang less dynamic so it would be less of a target for hackers.

It was a fun little project to do the conversion, and I’ve detailed it below the cut for anyone who’s interested. Either way, the upside is that this whole thing has me more excited about writing and blogging in general, so hopefully I can ride this enthusiasm to up my posting schedule beyond the meager 2-3 posts/year that it’s become.

So, yeah. The blog has moved away from Wordpress and is now using Octopress. I would in no way recommend Octopress for most people, but it’s fantastic if you’re comfortable with code and version control. The site you’re seeing right now is not dynamic; it’s just a bunch of static HTML files that gets generated every time I make a new post. No database, no formatting engine, and no comments. But it should load faster, be easily portable to other hosts, etc.

Converting Posts

The first step up was getting my posts out of my Wordpress database and into the flat-file system that Octopress uses. The built-in conversion scripts that Octopress inherits from its underlying Jekyll engine (lord, talking about software projects quickly makes you sound ridiculous) worked pretty well for a first-pass at this. When it builds the site, Octopress just parses the files in a certain directory and turns them into posts. Something that’s not very well documented is that it uses the file’s extension to determine formatting. I haven’t pushed very far, but it looks like it recognizes .html, .markdown, and .textile. The Wordpress importer made a bunch of HTML files (since Wordpress is authored in HTML by default), but I had been using Textile for my blog, so it was just a matter of changing all their file extensions with a quick bit of shell scripting.

(I actually wasted a TON of time playing with automatic Textile->Markdown tools before I discovered that Jekyll handled Textile natively. My preferences in lightweight markup languages have shifted over the years, so I’ll probably be using Markdown from here on out. Thankfully, another advantage of the flat-file is that it’s easy to mix-and-match formatting styles as I go. Heck, maybe I’ll use raw HTML just for kicks once.)

Comments

Right, so the new blog wasn’t going to have comments, but I wanted to preserve the old comments people had made. I looked into a lot of options for this, but in the end just ended up writing a little Python script that pulled them out of the database and applied some formatting to them. I then had to manually append them to their appropriate posts; I probably could have figured out some clever way of cross-referencing the Wordpress database with the Octopress files, but my blog is pretty damned small and it wasn’t too hard to figure out which comments went where.

The script is super ugly (just meant as a quick one-off hack to get the job done) but if anyone else might find it useful, here it is:

Formatting Wordpress comments straight from the database

import sys
import os
import hashlib
import time
import textile
import MySQLdb


comments = {}

con = None

try:
    con = MySQLdb.connect('***', '***', '***', '***')
    cur = con.cursor(MySQLdb.cursors.DictCursor)
    cur.execute("SELECT * FROM wp_comments")

    rows = cur.fetchall()
    for row in rows:
        if row["comment_post_ID"] not in comments:
            comments[row["comment_post_ID"]] = []
        comments[row["comment_post_ID"]].append(row)
except Exception, e:
    raise e
finally:
    if con:
        con.close()

def comment_sort(comm1, comm2):
    d1 = comm1["comment_date_gmt"]
    d2 = comm2["comment_date_gmt"]
    stamp1 = time.mktime(d1.timetuple())
    stamp2 = time.mktime(d2.timetuple())

    if (stamp1 < stamp2):
        return -1
    if (stamp1 == stamp2):
        return 0
    if (stamp1 > stamp2):
        return 1


outer_html = """<div class="comment-section-header">
<span class="comment-header">%i archived comment%s</span> <span class="no-comments"><a href="/no-comments">Why no more comments?</a></span></div>
<ol id="comments" class="commentlist">
%s</ol>
"""
pingback_outer_html = """<div class="comment-section-header">
<span class="comment-header">%i archived pingback%s</span></div>
<ol id="pingbacks" class="pingbacklist">
%s</ol>
"""
comment_html = """    <li class="comment">
        <span class="comment-author vcard"><img src='http://www.gravatar.com/avatar/%s?s=64' class="photo avatar" height="64" width="64"/> <span class="commenter-name">%s</span> wrote:</span>
        %s
        <span class="comment-meta">Posted <abbr class="comment-published" title="%s">%s</abbr> </span>
    </li>
"""
pingback_html = """    <li class="pingback">
        <span class="pingback-meta vcard">From <span class="url"><a href='%s' rel='external nofollow'>%s</a></span> on <abbr class="comment-published" title="%s">%s</abbr></span>%s
    </li>
"""

for post in comments:
    post_comments = sorted(comments[post], comment_sort)

    pingback_html_list = []
    comment_html_list = []

    for comment in post_comments:
        hasher = hashlib.md5()
        hasher.update(comment["comment_author_email"].strip().lower())
        mailhash = hasher.hexdigest()

        raw_gmt = comment["comment_date_gmt"]
        timestamp = time.strftime("%Y-%m-%dT%H:%M:%S-0500", raw_gmt.timetuple())
        raw_local = comment["comment_date"]
        pretty_date = time.strftime("%B %-d, %Y at %-I:%M%p", raw_local.timetuple())

        if (comment["comment_type"] == "pingback"):
            pingback_html_list.append(pingback_html % (comment["comment_author_url"], comment["comment_author"], timestamp, pretty_date, comment["comment_content"]))
        else:
            if (len(comment["comment_author_url"]) > 0):
                author = '<a href="%s" rel="external nofollow">%s</a>' % (comment["comment_author_url"], comment["comment_author"])
            else:
                author = comment["comment_author"]
            comment_html_list.append(comment_html % (mailhash, author, textile.textile(comment["comment_content"]), timestamp, pretty_date))

    exported_html = '<hr/>\n'

    if (len(comment_html_list) > 0):
        exported_comment_list = ""
        for comment in comment_html_list:
            exported_comment_list += comment
        exported_html += outer_html % (len(comment_html_list), "s" if len(comment_html_list) > 1 else "", exported_comment_list)

    if (len(pingback_html_list) > 0):
        exported_pingback_list = ""
        for pingback in pingback_html_list:
            exported_pingback_list += pingback
        exported_html += pingback_outer_html % (len(pingback_html_list), "s" if len(pingback_html_list) > 1 else "", exported_pingback_list)

    exported_html

    exp = open("comments/%s.html" % (post), "w")
    exp.write(exported_html)
    exp.close()

Oddly enough, Octopress doesn’t have great support for posting a bunch of links in the sidebar. I took Balaji Sivaraman’s blogroll plugin and modified it. Previously, it looked at a directory where each YAML file represents a different link. I really didn’t want to make a new file every time I wanted to add a link to the sidebar, so in my version, there’s a _sidebar directory in which each file represents a section of the sidebar.

sidebar directory layout

_sidebar
  |- 1.projects_and_miscellany.yml
  |- 2.nifty_places.yml
  |- 3.cool_people.yml

The number at the front of the file name is parsed out to determine the ordering in the sidebar, and the rest of the filename gets titlecased to become the section label. Inside each file are multiple YAML documents representing each link. Now to add a new link, it’s just modifying the file instead of having to add a new one.

I also stripped out a bunch of dynamic JavaScript polling of RSS feeds and sorting based on most recent post, since it wasn’t really necessary for what I was doing.

My Octopress Sidebar plugin is now up on a GitHub repo.

Git

I had to finally sit down and get comfortable with Git. Octopress is pretty much entirely based around using Git to manage the blog, to deploy, etc. Git’s main competition (though I don’t think that’s the right word in this context… the other software playing in their space?) is Mercurial, and I’ve been using that a lot recently in developing Angel. It’s weird – the concepts are very similar, but Git comes at it from such a different angle. It’s been slow going so far as my brain adjusts to the Git way of doing things.

Mercurial still makes more intuitive sense to me, and I imagine it will be that way for a while, but GitHub’s popularity makes it hard to not at least try to get comfortable using it. So this proved a good motivator to buckle in and get to it.

Dropbox

Finally, I don’t do it often, but I wanted to be able to write a blog post on the go when I had to. Usually when I travel these days, I don’t bring my laptop with me. For most of what I need on the road, my iPad serves me just fine, has way better battery life, is lighter, is harder to break, and doesn’t have to be taken out of my bag at airport security.

But, as is much lamented by the technoscenti, it’s hard to do development things like source control and arbitrary scripting on the iPad, and thus far nobody’s coded up an app that streamlines Octopress deployment from iOS.

Some would give up. I saw it as a challenge.

Luckily for me, so did Dennis Wegner.

He had a pretty clever set up which included monitoring a Dropbox folder that mapped into his blog’s repository, with different directories for Drafts, Queued Posts, and the normal Published ones. He also had to do some trickery because it had to run through his home Mac mini. Rather than trusting some home machine to always be on, I figured I’d try and leverage an existing server.

I didn’t want to tempt fate at Dreamhost, though, by setting up the Linux Dropbox daemon to run forever on my account there. (Shared hosting; I don’t want to be a bad neighbor.) Thankfully, a friend of mine from college has a server set up that pretty much just serves as the technical playground for our old nerd crew. (This server is called “the bear.”) I’d do all my hosting there if I didn’t love Dreamhost’s panel and One-Click installs so much, but for something that doesn’t need a whole lot of space, it’s great.

SO – my Dropbox account is now syncing on the bear, which also has a copy of the blog’s Git repository, symlinking its relevant directories into the Dropbox. These files could change one of two ways:

They get updated from Dropbox, meaning I edited them on my iPad (or some other machine where it was easier to get to Dropbox than syncing Git). In this case, the watcher script (more on that in a second) notices the change, automatically commits the change to Git, regenerates the site, and deploys the static files to hosting.
I push to the Git repository on the server from my laptop, which I’ll likely only do after running a deployment from it. In this case, the server repository automatically does an update on its working directory, thus updating the Dropbox files.

This does require a slight bit of maintenance on my part to remember that I either need to pull from the server (if I’ve updated the site via Dropbox magic) or to push to the server (if I’ve updated locally). That’s pretty similar to my usual workflow of pecking away at a project from multiple computers with source control as my go-between, so it’s not too bad.

Making the the Dropbox watcher script ended up being kind of fun. Once again, I modified someone else’s work, in this case I think adding a tiny bit more magic than was there previously. The original version of the script checked the _drafts folder for any file with a published attribute of true, then renamed it appropriately (so its filename matches the date) and moved it into the _queue folder. Then it would check the _queue folder for any posts with a published date set in the past, and move them into the _posts folder. If anything was set to be published, it rebuilds and deploys the site.

Pretty snazzy, but I added some more folds to it.

If it comes across a .txt file in the _drafts folder, it renames it to the Octopress scheme using today’s date and the first few words of the file as a temporary title. It also sticks some boilerplate YAML frontmatter (including published: false) in so that its attributes can be parsed by the rest of the script, in addition to giving it a .markdown extension.
It also checks the _posts folder for anything with a published: false attribute and moves it back to the _drafts folder.
If it changes or moves any files, it does so through Git and automatically commits to the server repository.

The first point lets me use Drafts or Nebulous Notes to make a quick start for a post and then just drop it into the _drafts directory on Dropbox without having to worry about the YAML frontmatter, the correct filename scheme, etc. Later, once I revise it and give the post a proper title, the system will be renaming it anyway.

The second point lets me effectively withdraw a post after it’s been published, in case I made some screwup or just didn’t mean it to go out yet.

The third addition just ensures that everything stays in sync across Dropbox, the server, and my home machine (through push/pulls).

(I also made sure that it would only parse the metadata (for example date: *) in the designated area at the front of file, as I found when trying to deal with this very post that matching that regex multiple times could spell strange doom for the script.)

The watcher script runs as a cron job every 5 minutes.

A modified watcher script that periodically does blog maintenance on designated directories

#!/usr/bin/env ruby


# Based on Dennis Wegner's it_queue.rb (https://github.com/derdennis/it_queue)

require 'rubygems'
require 'fileutils'
require 'time'

require '../plugins/titlecase'

empty_yaml = <<EOC
---
layout: post
title: %s
tags:
- Bamboo Cyberdream
status: draft
type: post
date: %s
published: false
meta: {}
---
EOC

# Determine directory in which we live
@script_dir = File.expand_path(File.dirname(__FILE__))

# This get's set to 1, if we find a post to publish
build = 0

# This is the deploy command for octopress
deploy_cmd = "rake gen_deploy"
# This is the dir where octopress lives
build_folder = File.expand_path(File.join(@script_dir, ".."))

# Folders where the content lives
drafts_folder = File.join(build_folder, "source", "_drafts")
queue_folder = File.join(build_folder, "source", "_queue")
posts_folder = File.join(build_folder, "source", "_posts")
# Maybe check for the upload of referenced images?
images_folder = File.join(build_folder, "source", "images", "blog_images")

# Let's go to the drafts_folder!
Dir.chdir(drafts_folder)

puts " "
puts "Now checking for finished drafts..."
puts " "
puts "=" * 50
# Check for drafts in _drafts which were set to "published:true"
Dir.entries(Dir.pwd).each do |post|
    next if post.start_with?(".")

    # Tell me at which post you are looking
    print "Checking post ", post, "\n"

    if post.end_with?(".txt")
        print "Prepping text file ", post, "\n"
        first_line = File.open(post, &:readline)
        first_line.gsub!("#", "")
        bits = first_line.split()
        file_title = bits[0..5].join('-').downcase # max of 6 words
        time_string = "#{Time.now.strftime('%Y-%m-%d %H:%M')}"
        date_string = "#{Time.now.strftime('%Y-%m-%d')}"
        text_title = bits.join(' ').titlecase

        yaml_out = empty_yaml % [text_title, time_string]
        filename = "#{date_string}-#{file_title}.markdown"
        filename.gsub!(/^.*(\\|\/)/, '')
        filename.gsub!(/[^0-9A-Za-z.\-]/, '_')

        file_text = File.open(post, &:read)
        file_text = yaml_out + file_text

        FileUtils.remove_file(post)
        File.open(filename, "w") { |f| f.write(file_text) }

        %x{git add #{filename}}
        %x{git commit -m "Watcher script processing new file."}

        post = filename
    end

    # Get the post date from inside the post
    File.open( post ) do |f|
        frontmatter = f.read().split("---")[1]
        frontmatter.lines.grep( /^date: / ) do |line|
            @post_date = line.gsub(/date: /, "").gsub(/\s.*$/, "")
            break
        end
    end

    # Get the post title from the currently processed post
    @post_title = post.to_s.gsub(/\d{4}-\d{2}-\d{2}/, "")

    # Build the new filename
    @new_post_name = @post_date + @post_title

    File.open( post ) do |f|
        frontmatter = f.read().split("---")[1]
        frontmatter.lines.grep( /^published: true/ ) do |line|
            # Move these post to the _queue-folder
            print "Moving post ", post, " to queue folder, ", "renaming to ", @new_post_name, "\n"
            # FileUtils.mv(post, queue_folder + '/' + @new_post_name)
            %x{git mv #{post} #{queue_folder}/#{@new_post_name}}
            %x{git add #{queue_folder}/#{@new_post_name}}
            %x{git commit -m "Watcher script moving draft to queue folder."}

            puts "=" * 50
            break
        end
    end
end

# Let's go to the posts_folder!
Dir.chdir(posts_folder)
puts " "
puts "Checking to see if there are any published posts which need to get pulled back..."
puts " "
puts "=" * 50
# Check to see if any of these have been set to "published: false."
Dir.entries(Dir.pwd).each do |post|
    next if post.start_with?(".")

    print "Checking post ", post, "\n"
    File.open( post ) do |f|
        frontmatter = f.read().split("---")[1]
        frontmatter.lines.grep( /^published: false/ ) do |line|
            # Move these back to the _drafts-folder
            print "Moving retracted post ", post, " back to drafts folder.\n"
            # FileUtils.mv(post, drafts_folder + '/' + post)
            %x{git mv #{post} #{drafts_folder}/#{post}}
            %x{git commit -m "Watcher script moving retracted post to drafts folder."}
            build = 1

            puts "=" * 50
            break
        end
    end
end

# Let's go to the queue_folder!
Dir.chdir(queue_folder)
puts " "
puts "Now checking for queued posts, which are ready for publishing..."
puts " "
puts "=" * 50
# Check for the "date: " part of the posts inside of queue. 
Dir.entries(Dir.pwd).each do |post|
    next if post.start_with?(".")

    print "Checking post ", post, "\n"
    File.open( post ) do |f|
        frontmatter = f.read().split("---")[1]
        frontmatter.lines.grep( /^date: / ) do |line|
            # Show me the filename and the matching line
            # Build a Time-object out of the date string
            post_date = Time.parse(line.gsub(/date: /, "").gsub(/\n$/, ""))
            now_date = Time.now
            print "Post date: ", post_date.inspect, "\n"
            print "Now date: ", now_date.inspect, "\n"

            if post_date < now_date
                puts "Post date is in the past. Publish!"
                #puts "Moving post to posts folder..."
                # FileUtils.mv post, posts_folder
                %x{git mv #{post} #{posts_folder}}
                %x{git add #{posts_folder}/#{post}}
                %x{git commit -m "Watcher script publishing queued post."}
                # Set build variable to 1
                build = 1
            else
                puts "Post date is in the future. Do nothing."
            end
            puts "=" * 50
            break
        end
    end
end

# Do a deployment if we need to. 
if build == 1
    puts "We build the site!"
    Dir.chdir(posts_folder)
    output = %x{#{deploy_cmd}}
end

Whew!

Now I got a fancy new blog, and I had some fun doing it. (Yes, this is my idea of fun. I’m a professional nerd, what do you want? No, I won’t fix your computer. Did you reboot it, yet?)