2012-05-16 Blog Napping

Ok, so I wanted a local copy of Metal Earth in order to better prepare for my game. Based on previous work I had done, this proved to be fairly easy and I improved my scripts along the way. Yay!

Metal Earth

previous work

*The Metal Earth* is a blogspot blog with full page content in the Atom feed.

To identify the blog, look at the source of any page. The HTML header will contain a line like the following: `<link rel="service.post" type="application/atom+xml" title="..." href="http://www.blogger.com/feeds/XXX/posts/default" />` – this is where you get the number from. In this case, the number is 2248254789731612355.

*download.sh* – this file downloads the atom feed files:

#! /bin/sh
for i in `seq 40`; do
  start=$((($i-1)*25+1))
  curl -o foo-$i.atom "http://www.blogger.com/feeds/2248254789731612355/posts/default?start-index=$start&max-results=25"
done

You’ll find that you only need to keep the first four of them.

*extract.sh* – this file calls the Perl script for every Atom file. You can use the -f option to force it to overwrite existing files.

#! /bin/sh
for f in *.atom; do
    perl extract.pl "$*" < $f
done

*extract.pl* – this file has several CPAN dependencies. It will parse the Atom file, look at each entry, and write it into a separate file. If the entry doesn’t have a title, it will parse the HTML content and try to guess a title (looking at the first H1 or the first SPAN element). It will warn you about duplicate names. It will also try to set the last modification time of the file to the update timestamp in the Atom file.

#!/usr/bin/perl
use strict;
use XML::LibXML;
use HTML::HTML5::Parser;
use Getopt::Std;
use DateTime::Format::W3CDTF;
use DateTime;
our $opt_f;
getopts('f');
undef $/;
my $data = <STDIN>;
my $parser = XML::LibXML->new();
my $doc = $parser->parse_string($data);
die $@ if $@;
my $encoding = $doc->actualEncoding();
my $context = XML::LibXML::XPathContext->new($doc);
$context->registerNs('atom', 'http://www.w3.org/2005/Atom');
my $html_parser;
foreach my $entry ($context->findnodes('//atom:entry')) {
  my $content = $entry->getChildrenByTagName('content')->[0]->to_literal;
  my $title = $entry->getChildrenByTagName('title')->[0]->to_literal;
  $title =~ s!/!_!gi;
  $title =~ s!&amp;!&!gi;
  $title =~ s!&#(\d+);!chr($1)!ge;
  if (not $title) {
    if (not $html_parser) {
      $html_parser = HTML::HTML5::Parser->new;
    }
    my $html_doc = $html_parser->parse_string($content);
    # we don't know the HTML namespace for certain
    my $html_ns = $html_doc->documentElement->namespaceURI();
    my $html_context = XML::LibXML::XPathContext->new($html_doc);
    $html_context->registerNs('html', $html_ns);
    $title = $html_context->findnodes('//html:h1')->[0];
    $title = $html_context->findnodes('//html:span')->[0] unless $title;
    $title = $title->to_literal if $title;
    warn "Guessed missing title: $title\n";
  }
  my $f = DateTime::Format::W3CDTF->new;
  my $dt = $f->parse_datetime($entry->getChildrenByTagName('updated')->[0]->to_literal)->epoch;
  my $file = $title . ".html";
  if (-f $file and ! $opt_f) {
    warn "$file exists\n";
  } else {
    open(F, ">:encoding($encoding)", $file) or die $! . ' ' . $file;
    print F <<EOT;
<html>
<head>
<meta content='text/html; charset=$encoding' http-equiv='Content-Type'/>
</head>
<body>
$content
</body>
</html>
EOT
    close F;
    utime $dt, $dt, $file;
  }
}

#Blogs

Comments

(Please contact me if you want to remove your comment.)

⁂

Recently I wanted a copy of *Elfmaids & Octopi* because the owner announced on Reddit that they were going to move elsewhere.

The directory structure I used:

┬ Elfmaids & Octopi
├ feed
└ html

This is how I got a copy of the feed, download.sh in the top folder:

#! /bin/sh
for i in `seq 80`; do
  start=$((($i-1)*25+1))
  curl -o foo-$i.atom "https://www.blogger.com/feeds/737809845612070971/posts/default?start-index=$start&max-results=25"
done

This downloads a bit more than seventy Atom files plus a few nearly empty Atom files. I moved these into the first subdirectory:

mv *.atom feed

I installed the missing dependency for my Perl script. Depending on your setup you might have more dependencies missing, and you might have to use cpan instead of my favourite, cpanm):

cpanm HTML::HTML5::Parser

I saved the Perl script as extract.pl:

#!/usr/bin/perl
use strict;
use XML::LibXML;
use HTML::HTML5::Parser;
use Getopt::Std;
use DateTime::Format::W3CDTF;
use DateTime;
our $opt_f;
getopts('f');
undef $/;
my $data = <STDIN>;
my $parser = XML::LibXML->new();
my $doc = $parser->parse_string($data);
die $@ if $@;
my $encoding = $doc->actualEncoding();
my $context = XML::LibXML::XPathContext->new($doc);
$context->registerNs('atom', 'http://www.w3.org/2005/Atom');
my $html_parser;
foreach my $entry ($context->findnodes('//atom:entry')) {
  my $content = $entry->getChildrenByTagName('content')->[0]->to_literal;
  my $title = $entry->getChildrenByTagName('title')->[0]->to_literal;
  $title =~ s!/!_!gi;
  $title =~ s!&amp;!&!gi;
  $title =~ s!&#(\d+);!chr($1)!ge;
  if (not $title) {
    if (not $html_parser) {
      $html_parser = HTML::HTML5::Parser->new;
    }
    my $html_doc = $html_parser->parse_string($content);
    # we don't know the HTML namespace for certain
    my $html_ns = $html_doc->documentElement->namespaceURI();
    my $html_context = XML::LibXML::XPathContext->new($html_doc);
    $html_context->registerNs('html', $html_ns);
    $title = $html_context->findnodes('//html:h1')->[0];
    $title = $html_context->findnodes('//html:span')->[0] unless $title;
    $title = $title->to_literal if $title;
    warn "Guessed missing title: $title\n";
  }
  my $f = DateTime::Format::W3CDTF->new;
  my $dt = $f->parse_datetime($entry->getChildrenByTagName('updated')->[0]->to_literal)->epoch;
  my $file = "html/$title.html";
  if (-f $file and ! $opt_f) {
    warn "$file exists\n";
    my $i = 2;
    $i++ while -f "html/$title ($i).html";
    $file = "html/$title ($i).html";
  }
  open(F, ">:encoding($encoding)", $file) or die $! . ' ' . $file;
  print F <<EOT;
<html>
<head>
<meta content='text/html; charset=$encoding' http-equiv='Content-Type'/>
</head>
<body>
$content
</body>
</html>
EOT
  close F;
  utime $dt, $dt, $file;
}

And I saved a simple wrapper as extract.sh:

#! /bin/sh
for f in feed/*.atom; do
    perl extract.pl "$*" < $f
done

And finally I moved all the HTML files into the second subdirectory:

mv *.html html

Done!

– Alex