If Wishes Were Yaks

Last night, I had a big post about Yak Barber, and how it was to a point that was functional-enough to post about it. However, after I had posted it, I kept fiddling with it anyway. My milestone of ‘well the Atom feed is broken’ wasn’t good enough for myself. I ended up staying up several more hours trying to do the thing I said I would wait to do. Even after that, I couldn’t fall asleep immediately because I was thinking of all the spaghetti code I want to yank out (yak out?).

Self Critique

Importing the settings from the settings.py file relies on the imp module. This is so I can have a command line argument to specify a different location. If I just hardcoded a path, then it would work, but you’d always use the same path. Python will gladly import the settings.py file. It’ll also gladly complain that it’s not really how that’s supposed to work. So kudos for whining.

The settings that do get read in are then piped to local variables. This is is in the event that I need to override something locally with future command line args. It will also throw a warning if these required variables are missing.

root = settings.root
webRoot = settings.webRoot
contentDir = settings.contentDir
templateDir = settings.templateDir
outputDir = settings.outputDir
sitename = settings.sitename
author = settings.author
md = settings.md
postsPerPage = settings.postsPerPage
typekitId = settings.typekitId

Then, because of the way python is structured, the main function is called. This triggers some steps to make sure directories are there, and if not, create them. It in turn triggers start() which handles the broad strokes of processing the pages and moving files by delegating out to other functions. Nothing bad here, really.

Then, unfortunately, you get to the part that renders the pages:

def renderPost(post, posts):
  metadata = {}
  for k, v in post[0].iteritems():
    metadata[k] = v[0]
  metadata[u'content'] = post[1]
  metadata[u'sitename'] = sitename
  metadata[u'webRoot'] = webRoot
  metadata[u'author'] = author
  metadata[u'typekitId'] = typekitId
  postName = removePunctuation(metadata[u'title'])
  postName = metadata[u'date'].split(' ')[0] + '-' + postName.replace(u' ',u'-')
  postName = u'-'.join(postName.split('-'))
  postFileName = outputDir + postName + '.html'
  metadata[u'postURL'] = webRoot + postName + '.html'
  metadata[u'title'] = unicode(smartypants.smartypants(metadata[u'title']))
  with open(templateDir + u'/post-content.html','r','utf-8') as f:
    postContentTemplate = f.read()
    postContent = pystache.render(postContentTemplate,metadata,decode_errors='ignore')
    metadata['post-content'] = postContent
  with open(templateDir + u'/post-page.html','r','utf-8') as f:
    postPageTemplate = f.read()
    postPageResult = pystache.render(postPageTemplate,metadata,decode_errors='ignore')
  with open(postFileName,'w','utf-8') as f:
    f.write(postPageResult)
  return posts.append(metadata)

The markdown module generates a Python dictionary from the MMD-style metadata. That’s just the crap at the top of the file with colons. First empty line, or line without colons, seperates the metadata from your text in your document. I’m overriding several values in the dictionary with the variables we’ve pulled from settings.py.

If you think that’s not so bad, just wait until you see the the part that makes index pages and the atom feed! It’s like opening a wall in your house and finding black mold. Oh boy!

def feed(posts):
  feedDict = posts[0]
  entryList = str()
  feedDict['gen-time'] = datetime.datetime.utcnow().isoformat('T') + 'Z'
  with open(templateDir + u'/atom.xml','r','utf-8') as f:
    atomTemplate = f.read()
  with open(templateDir + u'/atom-entry.xml','r','utf-8') as f:
    atomEntryTemplate = f.read()
  for e,p in enumerate(posts):
    p[u'date'] = RFC3339Convert(p[u'date'])
    p[u'content'] = extractTags(p[u'content'],'script')
    p[u'content'] = extractTags(p[u'content'],'object')
    p[u'content'] = extractTags(p[u'content'],'iframe')
    if e < 100:
      atomEntryResult = pystache.render(atomEntryTemplate,p)
      entryList += atomEntryResult
  feedDict['atom-entry'] = entryList
  feedResult = pystache.render(atomTemplate,feedDict,string_encode='utf-8')
  with open(outputDir + 'feed','w','utf-8') as f:
    f.write(feedResult)

def paginatedIndex(posts):
  indexList = sorted(posts,key=lambda k: k[u'date'])[::-1]
  feed(indexList)
  postList = []
  for i in indexList:
    postList.append(i['post-content'])
  indexOfPosts = splitEvery(postsPerPage,indexList)
  with open(templateDir + u'/index.html','r','utf-8') as f:
    indexTemplate = f.read()
  indexDict = {}
  indexDict[u'sitename'] = sitename
  indexDict[u'typekitId'] = typekitId
  for e,p in enumerate(indexOfPosts):
    indexDict['post-content'] = p
    print e
    for x in p:
      print x['title']
    if e == 0:
      fileName = u'index.html'
      if len(indexList) > postsPerPage:
        indexDict[u'previous'] = webRoot + u'index2.html'
    else:
      fileName = u'index' + str(e+1) + u'.html'
      if e == 1:
        indexDict[u'next'] = webRoot + u'index.html'
        indexDict[u'previous']  = webRoot + u'index' + str(e+2) + u'.html'
      else:
        indexDict[u'previous'] = webRoot + u'index' + str(e+2) + u'.html'
        if e < len(indexList):
          indexDict[u'next'] = webRoot + u'index' + str(e-1) + u'.html'
    indexPageResult = pystache.render(indexTemplate,indexDict)
    with open(outputDir + fileName,'w','utf-8') as f:
      f.write(indexPageResult)

You’ll notice that there’s a lot of repeated keys and values from the renderPosts() function. In fact, it even starts with the output the previous function generated. The way that the Python version of Mustache was handling lists in the dictionary resulted in an Atom XML file with Python list syntax being wedged between all the XML tags. That wasn’t cool. It needed to be looped over separately in just the right way to make a dictionary pystache wouldn’t barf all over. Because it needs to have things handled in a special way, and because the index pagination needs to happen in a special way, I have to jump through these hoops to make these separate, looped dictionaries just to make the “simple” template engine happy. That is a lot of code duplication in there setting dictionary keys.

You’ll also notice the feed has date string shenanigans in order to meet the UTC (Greenwich Mean Time) requirement of Atom, the tantalizingly named RFC-3339 specification. Turns out, Python doesn’t really have a singular module that does this. So I get to use time, datetime, and pytz to parse the date and time, convert it to 2 different formats, stick an origin timezone on it, and feed it to a third party library that actually understands what timezones are. To say that this is a deficiency of Python would be an understatement.

def RFC3339Convert(timeString):
  strip = time.strptime(timeString, '%Y-%m-%d %H:%M:%S')
  dt = datetime.datetime.fromtimestamp(time.mktime(strip))
  pacific = pytz.timezone('US/Pacific')
  ndt = dt.replace(tzinfo=pacific)
  utc = pytz.utc
  return ndt.astimezone(utc).isoformat().split('+')[0] + 'Z'

But don’t worry, that’s not the only thing the W3C Feed Validation Widget wanted to yell at me for! It is also considered “bad” to have <script>,<object>, and <iframe> tags in your XML, even if they’re inside of the part that’s just HTML. This means things like YouTube embeds and Twitter’s embedded tweets need to be sanitized. It is certainly, without a doubt, not even worthwhile to use embedded tweets in the future.

After spelunking through StackOverflow all night for time problems, I got to go back and look for the best way to remove tags. It is not considered wise to use regex on XML/HTML to remove tags. Fine, eggheads, what do you recommend? Enter BeautifulSoup.

def extractTags(html,tag):
  soup = BeautifulSoup.BeautifulSoup(html)
  to_extract = soup.findAll(tag)
  for item in to_extract:
    item.extract()
  return unicode(soup)

This explains all those entries in feed where the content key kept getting scribbled over. That isn’t the right way to do that. I should feed a list of tags to remove to the extractTags() function and I should write to a new key on the dictionary. If I did that, then the information could co-exist with the original content value and I could make it part of renderPosts().

Basically, there should be one, singular dictionary, or Class object, that carries all the information a post. There should be one, singular dictionary, or Class object, that carries all of the posts together. Then all the functions should yank data from that one tree of data instead of generating what is, essentially, identical stuff.

In terms of benchmarks:


1939885 function calls (1925524 primitive calls) in 2.171 seconds
1939758 function calls (1925400 primitive calls) in 3.014 seconds
1939758 function calls (1925400 primitive calls) in 2.663 seconds

Processing 122 discrete text files in 2-3 seconds doesn’t strike me as especially bad. The biggest offender is — surprise — Python’s regex substitution. A teeny-tiny part of the code that removes characters from the titles to generate filenames and URL’s. Oh regex! (arms akimbo)

I should have just gone to bed.

2014-05-21 15:00:00

Category: text