Chris Hager
Programming, Technology & More

Find broken hyperlinks in a PDF document with PDFx

PDFx is a free command-line tool to extract references, links and metadata from PDF files. You can also use it to find broken links in a PDF file, using pdfx -c:

PDFx Link Checker

For each URL and PDF reference, pdfx performs a HEAD request and checks the status code. It there are broken links, PDFx print the link with the page number where the link was found in the original pdf:

$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -c
Document infos:
- CreationDate = D:20150821110623-04'00'
...

Summary of link checker:
33 working
1 broken (reason: 303)
  - http://www.nytimes.com/interactive/2013/11/23/us/politics/23nsa-sigint-strategy-document.html (page 13)
1 broken (reason: 403)
  - http://www.nytimes.com/interactive/2013/11/23/us/ (page 13)
1 broken (reason: 404)
  - https://github.com/bumptech/stud/blob/ (page 13)

Installing PDFx

You can simply install PDFx with easy_install or pip and run it like this:

$ sudo easy_install -U pdfx
...
$ pdfx <pdf-file-or-url>

Run pdfx -h to see the help output:

$ pdfx -h
usage: pdfx [-h] [-d OUTPUT_DIRECTORY] [-c] [-j] [-v] [-t] [-o OUTPUT_FILE]
            [--version]
            pdf

Extract metadata and references from a PDF, and optionally download all
referenced PDFs. Visit https://www.metachris.com/pdfx for more information.

positional arguments:
  pdf                   Filename or URL of a PDF file

optional arguments:
  -h, --help            show this help message and exit
  -d OUTPUT_DIRECTORY, --download-pdfs OUTPUT_DIRECTORY
                        Download all referenced PDFs into specified directory
  -c, --check-links     Check for broken links
  -j, --json            Output infos as JSON (instead of plain text)
  -v, --verbose         Print all references (instead of only PDFs)
  -t, --text            Only extract text (no metadata or references)
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        Output to specified file instead of console
  --version             show program's version number and exit

For more examples and infos, take a look at the PDFx project page. You can find the code on Github, the code is released under the Apache license.


Feedback, ideas and pull requests are welcome! You can also reach me on Twitter via @metachris.


Free Transactional Email Services - The Best Alternatives to Mandrill & Co.

Mandrill, the beloved-by-many transactional email service, recently announced that it will switch to a paid-only model under the MailChimp umbrella. This came as a surprise to many developers who used them for sending emails for free from various servers and backends. This post provides a round-up of some of the more popular Mandrill alternatives.

Free Transactional Email Providers

This section covers a number of popular transactional email providers with a free tier.

Mailgun

  • http://www.mailgun.com/pricing
  • First 10,000 emails per month free, then starting at $5 per 5k mails
  • Used by: GitHub, Stripe, Heroku
  • Sending via SMTP, HTTP API
  • Features: Delivery statistics, Recipient Variables (first name, order value, …), Open, click & unsubscribe tracking, Dedicated IP, Inbound Email

Amazon SES (Simple Email Service)

  • http://aws.amazon.com/ses/pricing
  • 62k emails per month free, then $0.10 per 1,000. $0.12 per GB of attachments sent
  • Sending via SMTP, HTTP API
  • Features: Analytics for last 24 hours only (delivery attempts, bounces, complaints, rejects), Only Basic Interface, Inbound Email

SparkPost

Mailjet

  • https://www.mailjet.com/pricing_v3
  • Free for 6k emails / month, paid plans starting at $7.50 for 30k mails
  • Sending via SMTP, HTTP API
  • Features: Tracking, Analytics, Personalization, Templates, Dedicated IP, Inbound Email

Elastic Email

  • http://elasticemail.com/pricing
  • First 25k emails per month free, then $0.19 per thousand
  • Sending via SMTP, HTTP API
  • Features: Email Open and Click Tracking, Templates, Dedicated IP, Inbound Email API

SendinBlue

  • https://www.sendinblue.com/pricing
  • 9k emails per month free (limited to 300 per day)
  • Sending via SMTP, HTTP API
  • Features: Delivery statistics, Templates, Template Editor, SMS Marketing, Dedicated IP

smtp.google.com

Postage App

  • https://secure.postageapp.com/register
  • 1k emails per month free (max 100/day), $9 for 10k emails/month, $29 for 40k emails/month, $79 for 100k emails/month
  • Sending via SMTP, HTTP API
  • Features: Analytics, Templates, Personalization, Dedicated IP

Paid-Only Transactional Email Providers

This incomplete section covers a number of paid-only transactional email providers.

Mandrill

  • http://www.mandrill.com/pricing
  • by Mailchimp, paid starting March 16
  • Starting at $20 for 25k emails, then about $.8/k.
  • Features: Fine grained API, Open Source Libraries, Dedicated IP, Templates, Inbound Email

Postmark App

  • https://postmarkapp.com/pricing
  • $1.50 per 1k emails
  • First 25k emails free
  • Sending via SMTP, HTTP API
  • Features: Analytics, Open tracking, Dedicated IP, Templates, Inbound Email, Full content for every mail of past 45 days

SMTP.com

Sendgrid

  • https://sendgrid.com/pricing
  • Free plan for 12k emails / month. Paid plans starting at $10 for 40k emails, or $0.10 for 1k emails
  • 30 day trial with 40k emails
  • Sending via SMTP, HTTP API
  • Features: Analytics (open, click, bounce, etc.), Templates, Mobile apps, Fine grained API permissions, Dedicated IP, Inbound Email

Summary

Finally a quick overview in tabular form:

ServiceFree Tier20k Emails50k Emails100k EmailsSMTPAPIDedicated IPTemplatesInbound Email
Mailgun10k/m$5$20$45
Amazon SES62k/m$0$0~$4
SparkPost100k/m$0$0$0
MailJet6k/m$7.49$21.95$74.95
Elastic Email25k/m$0$4.75$14.25
Sendin Blue300/day, 9k/m$7.37$39$66
Gmail SMTP2k/day, 60k/m$0$0-
Postage App1k/month$9$79$79
Mandrill0$20$40$80
Postmark0$30$75$150
SMTP.com0$70$70$160
SendGrid0$9.95$19.95$19.95

You can contact me on Twitter, and Discuss this post on Hacker News.

Update 2017-01-11: Updated SendGrid prices (no longer has a free tier)
Update 2016-03-22: Updated Mailchimp prices, Added dedicated IPs to Postmark


How to Optimize Wordpress Performance with nginx and WP Super Cache

This is a simple and effective method how to serve Wordpress pages blazingly fast: produce static HTML files with WP Super Cache, and serve them directly with nginx.

  • WP Super Cache (on Github) is an immensely popular, official Wordpress caching plugin with more than 1 million active installations. Basically the plugin produces static html pages of your posts and pages, and anonymous users can directly load the html without any interaction with PHP.

  • nginx is a very fast, flexible webserver, reverse proxy, load balancer and cache. If you are using Apache, WP Super Cache explains mod_rewrite rules to achieve the same thing as explained in this post.

After installing the WP Super Cache plugin, enable caching in the plugin settings, and configure the garbage collector timeout according to your needs.

Once enabled, visits by anonymous users (user which are not logged in) will create static .html files in the /wp-content/cache/supercache/ directory.


To achieve optimal performance, serve those .html files directly, bypassing PHP altogether. The following nginx config snippet works well for http:// and https:// sites. I save this snippet as wp-supercache.conf and reference it from the various server configs.

set $cache_uri $request_uri;

# POST requests and urls with a query string should always go to PHP
if ($request_method = POST) {
    set $cache_uri 'null cache';
}
if ($query_string != "") {
    set $cache_uri 'null cache';
}

# Don't cache uris containing the following segments
if ($request_uri ~* "(/wp-admin/|/xmlrpc.php|/wp-(app|cron|login|register|mail).php
                      |wp-.*.php|/feed/|index.php|wp-comments-popup.php
                      |wp-links-opml.php|wp-locations.php |sitemap(_index)?.xml
                      |[a-z0-9_-]+-sitemap([0-9]+)?.xml)") {

    set $cache_uri 'null cache';
}

# Don't use the cache for logged-in users or recent commenters
if ($http_cookie ~* "comment_author|wordpress_[a-f0-9]+
                     |wp-postpass|wordpress_logged_in") {
    set $cache_uri 'null cache';
}

# Set the cache file
set $cachefile "/wp-content/cache/supercache/$http_host/$cache_uri/index.html";
if ($https ~* "on") {
    set $cachefile "/wp-content/cache/supercache/$http_host/$cache_uri/index-https.html";
}

# Add cache file debug info as header
#add_header X-Cache-File $cachefile;

# Try in the following order: (1) cachefile, (2) normal url, (3) php
location / {
    try_files $cachefile $uri $uri/ /index.php;
}

This config snippet is based on the nginx.com article 9 Tips for Improving WordPress Performance, with added support for https.


Now just include the above snippet from a nginx server config:

server {
    listen 80;
    listen 443 ssl http2;
    server_name www.foremka.at;

    client_max_body_size 24m;
    gzip on;

    root /var/www/foremka.at/htdocs;
    index index.php;

    include snippets/wp-supercache.conf;  # <-- here we reference the wp-supercache snippet

    location ~ \.php$ {
        try_files $uri $uri/ /index.php?$args;
        include fastcgi.conf;
        fastcgi_pass unix:/var/run/php5-fpm.sock;
    }

    # Caching of media: images, icons, video, audio, HTC
    location ~* \.(?:jpg|jpeg|gif|png|ico|cur|gz|svg|svgz|mp4|ogg|ogv|webm|htc|woff|woff2)$ {
        expires 2M;
        add_header Cache-Control "public";
    }

    # CSS and Javascript
    location ~* \.(?:css|js)$ {
        expires 1d;
        add_header Cache-Control "public";
    }
}

As always, test and reload the nginx config with nginx -t && nginx -s reload. This command will only reload the config if it doesn’t contain any errors.


Now let’s test if it works:

  1. The first anonymous visit will call the PHP code and produce the static .html file
  2. The second anonymous visit will receive the cached html
# First call to the website serves directly from wordpress
$ curl -s -D - http://www.foremka.at -o /dev/null
HTTP/1.1 200 OK
Server: nginx
Date: Sun, 21 Feb 2016 12:12:42 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
X-Powered-By: PHP/5.5.9-1ubuntu4.14
Vary: Cookie
Link: <http://www.foremka.at/wp-json/>; rel="https://api.w.org/"
Link: <http://www.foremka.at/>; rel=shortlink
Strict-Transport-Security: max-age=15768000

# Second call to the website should serve the plain html
$ curl -s -D - http://www.foremka.at -o /dev/null
HTTP/1.1 200 OK
Server: nginx
Date: Sun, 21 Feb 2016 12:13:48 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 7146
Last-Modified: Sun, 21 Feb 2016 12:12:42 GMT
Connection: keep-alive
Vary: Accept-Encoding
ETag: "56c9a9ba-1bea"
Strict-Transport-Security: max-age=15768000
Accept-Ranges: bytes

How do we know it didn’t go through the PHP WP Super Cache? Because WP Super Cache adds it’s own header, as you can see in this example with the nginx config disabled:

$ curl -s -D - http://www.foremka.at -o /dev/null
HTTP/1.1 200 OK
Server: nginx
Date: Sun, 21 Feb 2016 12:15:12 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
X-Powered-By: PHP/5.5.9-1ubuntu4.14
Vary: Accept-Encoding, Cookie
Cache-Control: max-age=3, must-revalidate
WP-Super-Cache: Served supercache file from PHP  <-- WP Super Cache Header
Strict-Transport-Security: max-age=15768000

Troubleshooting

If you run into problems, you can always enable the WP Super Cache debugging which logs a lot of info into a logfile, and manually delete the /wp-content/cache directory for a fresh start.

You can easily verify that WP Super Cache is producing the static files by showing all files in the directory wp-content/cache, for instance with the command tree or find ./, or by opening the “Contents” tab in the WP Super Cache Settings and then clicking “Regenerate cache stats”.


And that’s it! Enjoy your blazingly fast Wordpress setup 😎🚀

If you have suggestions or feedback, let me know via @metachris.


References

Blog Archive
swirl