Chris Hager
Programming, Technology & More

PDFx v1.0 - Extract metadata and URLs from PDFs, and download all referenced PDFs

I just released PDFx version 1.0, a Python tool and library to extract metadata and URLs from PDFs, and to automatically download all referenced PDFs. The project is released under the Apache license with the source code on Github!

Features

  • Extract metadata and PDF URLs from a given PDF (file or URL)
  • Download all PDFs referenced in the original PDF
  • Works with local and online pdfs
  • Use as command-line tool or Python package
  • Compatible with Python 2 and 3

Quick Start

Grab a copy of pdfx with easy_install or pip and run it:

$ easy_install -U pdfx
...
$ pdfx <pdf-file-or-url>

Run pdfx -h to see the help output:

$ pdfx -h
usage: pdfx [-h] [-d OUTPUT_DIRECTORY] [-j] [-v] [--debug] [--version] pdf

Get infos and links from a PDF, and optionallydownload all referenced PDFs.
See http://www.metachris.com/pdfx for more information.

positional arguments:
  pdf                   Filename or URL of a PDF file

optional arguments:
  -h, --help            show this help message and exit
  -d OUTPUT_DIRECTORY, --download-pdfs OUTPUT_DIRECTORY
                        Download all referenced PDFs into specified directory
  -j, --json            Output infos as json (instead of plain text)
  -v, --verbose         Print all urls (instead of only PDF urls)
  --debug               Output debug infos
  --version             show program's version number and exit

By default pdfx only prints the information. If you add the -d flag, it downloads all referenced PDFs to the specified location:

$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -d ./
Document infos:
- CreationDate = D:20150821110623-04'00'
- Creator = LaTeX with hyperref package
- ModDate = D:20150821110805-04'00'
- PTEX.Fullbanner = This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1
- Pages = 13
- Producer = pdfTeX-1.40.14
- Title = Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice
- Trapped = False

17 PDF URLs:
- http://cr.yp.to/factorization/smoothparts-20040510.pdf
- http://www.spiegel.de/media/media-35671.pdf
- http://www.spiegel.de/media/media-35529.pdf
- http://cryptome.org/2013/08/spy-budget-fy13.pdf
- http://www.spiegel.de/media/media-35514.pdf
- http://www.spiegel.de/media/media-35509.pdf
- http://www.spiegel.de/media/media-35515.pdf
- http://www.spiegel.de/media/media-35533.pdf
- http://www.spiegel.de/media/media-35519.pdf
- http://www.spiegel.de/media/media-35522.pdf
- http://www.spiegel.de/media/media-35513.pdf
- http://www.spiegel.de/media/media-35528.pdf
- http://www.spiegel.de/media/media-35526.pdf
- http://www.spiegel.de/media/media-35517.pdf
- http://www.spiegel.de/media/media-35527.pdf
- http://www.spiegel.de/media/media-35520.pdf
- http://www.spiegel.de/media/media-35551.pdf

Downloading 17 pdfs to './'...
All done!

To do

  • https://github.com/metachris/pdfx/issues

Reach out to me on Twitter @metachris


References



How to install NodeJS 4.x (LTS) on CentOS

NodeJS v4.x is deprecated. Take a look at the current instructions on how to install NodeJS 6.x Long-Term Support (LTS) and NodeJS 7.x.


Recently io.js and node.js merged again into a single codebase in Node v4.0.0. The fork happened in December 2014 and io.js has seen rapid improvements and fast uptake of upstream V8 features.

Now the merge has happened, and with it comes a lot of features from the io.js codebase. NodeJS 4.2.0 has been released on October 12, which is labelled as the first Long Term Support (LTS) release of NodeJS. This means that NodeJS v4.x is going to be officially supported without backward incompatible updates for 30 months, until June 2017.

Sadly the standard installation instructions on the downloads page for installing a current version of node with yum is not updated to the new NodeJS v4.x releases!

So here is a quick and easy way how to install the current NodeJS 4.x LTS (including npm) on CentOS from the official RPM repository.

# Install the repository
rpm -Uvh https://rpm.nodesource.com/pub_4.x/el/7/x86_64/nodesource-release-el7-1.noarch.rpm

# Install NodeJS
yum install nodejs

Enjoy NodeJS 4.x LTS on CentOS, and reach out to me on Twitter @metachris

See also:


Retrofit 2.0 Samples

This post is about using Retrofit 2.0 (beta) to consume HTTP based APIs.

Retrofit is a great and popular API client library for Java (and by extension also for Android) developed by Square. Here’s a few links to start things off:

Source code with samples for this post is available on Github.

Retrofit makes it easy to develop API clients by describing API endpoints and the results like this:

class Profile {
    String username;
    String email;
}

...

@GET("/profile")
Call<Profile> getProfile();

A great endpoint to test API calls is httpbin.org, a website/api which returns various information about the request and more.

Getting Started

First of all we need to include the Retrofit library in a project. Using gradle this is accomplished by adding the following dependencies to build.gradle:

compile 'com.squareup.retrofit:retrofit:2.0.0-beta2'
compile 'com.squareup.retrofit:converter-gson:2.0.0-beta2'

The first dependency includes Retrofit itself, the second dependency includes the GSON converter library for (de-)serialization of JSON objects. Retrofit supports a number of different converters such as JSON, XML, Protocol Buffers and more.

The Code

This is the sample code for a number of httpbin.org API requests with GET and POST:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
import retrofit.Call;
import retrofit.Callback;
import retrofit.GsonConverterFactory;
import retrofit.Response;
import retrofit.Retrofit;
import retrofit.http.*;

import java.io.IOException;
import java.util.Map;

public class HttpApi {

    public static final String API_URL = "http://httpbin.org";

    /**
     * Generic HttpBin.org Response Container
     */
    static class HttpBinResponse {
        // the request url
        String url;

        // the requester ip
        String origin;

        // all headers that have been sent
        Map headers;

        // url arguments
        Map args;

        // post form parameters
        Map form;

        // post body json
        Map json;
    }

    /**
     * Exemplary login data sent as JSON
     */
    static class LoginData {
        String username;
        String password;

        public LoginData(String username, String password) {
            this.username = username;
            this.password = password;
        }
    }

    /**
     * HttpBin.org service definition
     */
    public interface HttpBinService {
        @GET("/get")
        Call<HttpBinResponse> get();

        // request /get?testArg=...
        @GET("/get")
        Call<HttpBinResponse> getWithArg(
            @Query("testArg") String arg
        );

        // POST form encoded with form field params
        @FormUrlEncoded
        @POST("/post")
        Call<HttpBinResponse> postWithFormParams(
            @Field("field1") String field1
        );

        // POST with a JSON body
        @POST("/post")
        Call<HttpBinResponse> postWithJson(
            @Body LoginData loginData
        );
    }

    public static void testApiRequest() {
        // Retrofit setup
        Retrofit retrofit = new Retrofit.Builder()
                .baseUrl(API_URL)
                .addConverterFactory(GsonConverterFactory.create())
                .build();

        // Service setup
        HttpBinService service = retrofit.create(HttpBinService.class);

        // Prepare the HTTP request
        Call<HttpBinResponse> call = service.postWithJson(new LoginData("username", "secret"));

        // Asynchronously execute HTTP request
        call.enqueue(new Callback<HttpBinResponse>() {
            /**
             * onResponse is called when any kind of response has been received.
             */
            @Override
            public void onResponse(Response<HttpBinResponse> response, Retrofit retrofit) {
                // http response status code + headers
                System.out.println("Response status code: " + response.code());

                // isSuccess is true if response code => 200 and <= 300
                if (!response.isSuccess()) {
                    // print response body if unsuccessful
                    try {
                        System.out.println(response.errorBody().string());
                    } catch (IOException e) {
                        // do nothing
                    }
                    return;
                }

                // if parsing the JSON body failed, `response.body()` returns null
                HttpBinResponse decodedResponse = response.body();
                if (decodedResponse == null) return;

                // at this point the JSON body has been successfully parsed
                System.out.println("Response (contains request infos):");
                System.out.println("- url:         " + decodedResponse.url);
                System.out.println("- ip:          " + decodedResponse.origin);
                System.out.println("- headers:     " + decodedResponse.headers);
                System.out.println("- args:        " + decodedResponse.args);
                System.out.println("- form params: " + decodedResponse.form);
                System.out.println("- json params: " + decodedResponse.json);
            }

            /**
             * onFailure gets called when the HTTP request didn't get through.
             * For instance if the URL is invalid / host not reachable
             */
            @Override
            public void onFailure(Throwable t) {
                System.out.println("onFailure");
                System.out.println(t.getMessage());
            }
        });
    }
}

You can find a IntelliJ project with the full source code at github.com/metachris/retrofit2-samples.

Feedback, suggestions and pull requests are welcome!

Blog Archive
swirl