PDFx update and new version release (v1.4.1)
PDFx is a tool to extract text, links, references and metadata from PDF files and URLs. Thanks to several contributors the project received a thorough update and was brought into 2021. The new release of today is PDFx v1.4.1 ๐
PDFx works like this:
$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf
Document infos:
- CreationDate = D:20150821110623-04'00'
- Creator = LaTeX with hyperref package
- ModDate = D:20150821110805-04'00'
- PTEX.Fullbanner = This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1
- Pages = 13
- Producer = pdfTeX-1.40.14
- Title = Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice
- Trapped = False
- dc = {'title': {'x-default': 'Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice'}, 'creator': [None], 'description': {'x-default': None}, 'format': 'application/pdf'}
- pdf = {'Keywords': None, 'Producer': 'pdfTeX-1.40.14', 'Trapped': 'False'}
- pdfx = {'PTEX.Fullbanner': 'This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1'}
- xap = {'CreateDate': '2015-08-21T11:06:23-04:00', 'ModifyDate': '2015-08-21T11:08:05-04:00', 'CreatorTool': 'LaTeX with hyperref package', 'MetadataDate': '2015-08-21T11:08:05-04:00'}
- xapmm = {'InstanceID': 'uuid:4e570f88-cd0f-4488-85ad-03f4435a4048', 'DocumentID': 'uuid:98988d37-b43d-4c1a-965b-988dfb2944b6'}
References: 36
- URL: 18
- PDF: 18
PDF References:
- http://www.spiegel.de/media/media-35533.pdf
- http://www.spiegel.de/media/media-35513.pdf
- http://www.spiegel.de/media/media-35509.pdf
- http://www.spiegel.de/media/media-35529.pdf
- http://www.spiegel.de/media/media-35527.pdf
- http://cr.yp.to/factorization/smoothparts-20040510.pdf
...
It’s is written in Python and can be used as cli and as a library. It includes a bunch of features - json output, checking links, downloading references, and more. Check out the GitHub project and introduction blog post.
What is in the update
- Python compatibility from 2.7 to 3.10-dev
- Switch to pdfminer.six, a community-maintained fork of pdfminer
- Linting & static checks with flake8, pylint and mypy
- Tests: pytest, coverage
- black for code formatting
- GitHub workflows:
- Testing and linting (using Python versions 3.6 to 3.10): dashboard,
lint-and-test.yml
- Publishing a new version to PyPI: dashboard,
publish-to-pypi.yml
- Testing and linting (using Python versions 3.6 to 3.10): dashboard,
README.rst
->README.md
(inital conversion from rst to markdown with pandoc)- Detects pdf URLs that end with parameters (e.g.
?dl=1
on dropbox) - Include tests in PyPI tarball
- Partly inspiratied by How to make an awesome Python package in 2021
Try it out
# Install pdfx
pip install -U pdfx
# Run it with an URL
pdfx https://weakdh.org/imperfect-forward-secrecy.pdf
Options
$ pdfx -h
usage: pdfx [-h] [-d OUTPUT_DIRECTORY] [-c] [-j] [-v] [-t] [-o OUTPUT_FILE] [--version] pdf
Extract metadata and references from a PDF, and optionally download all referenced PDFs. Visit
https://www.metachris.com/pdfx for more information.
positional arguments:
pdf Filename or URL of a PDF file
optional arguments:
-h, --help show this help message and exit
-d OUTPUT_DIRECTORY, --download-pdfs OUTPUT_DIRECTORY
Download all referenced PDFs into specified directory
-c, --check-links Check for broken links
-j, --json Output infos as JSON (instead of plain text)
-v, --verbose Print all references (instead of only PDFs)
-t, --text Only extract text (no metadata or references)
-o OUTPUT_FILE, --output-file OUTPUT_FILE
Output to specified file instead of console
--version show program's version number and exit
Links & References
- GitHub project: https://github.com/metachris/pdfx
- PyPI: https://pypi.org/project/pdfx/
Feedback and ideas are appreciated!
Please leave a comment below (or reach out via twitter.com/metachris) ๐
If you enjoyed this post, consider subscribing to my newsletter or following me on Twitter to get updates on my new blog posts and current projects.