MarTech Hacks: When Google Doesn’t Like Your PDF’s

It’s a fact that Google has been able to index PDF’s and show them in search results since 2001. Yet there is a persistent belief in the online community that Google doesn’t really look at PDF usually because “my” PDF’s don’t show up on search results. Why does Google like some PDF’s and not others? In my experience there are two main causes:

1) There is no text to index. This happens most often when a brochure or some other printed material is scanned page by page as images, then those images are concatenated together into something resembling the original document. The document contains no text, just images of text, so there is nothing to index.

The hack: It’s really best to go back to the creator of the original document and obtain an electronic copy from them that you can convert to PDF, or have them convert it. If you can’t do that for some reason, many current scanners will perform Optical Character Recognition (OCR) on your document which allows them to convert the printed version to digital text. This isn’t perfect, even the best OCR’s have problems with tiny fonts, big serifs, variable kerning and the like, so after scanning you will need to check the text very carefully for typos created by the scan.

If you are able to obtain a copy of the original layout and convert that to PDF, you may still have problems.

2) The file is too big. Google will not as a rule index huge files, including PDF’s. Large PDF’s happen a lot with Adobe Photoshop, because Photoshop can produce an editable PDF, that is, the PDF can be pulled into another program for editing (this was one of the original purposes for Portable Document Format). That means Photoshop preserves the layers, channel data, masks, etc. which can make for a VERY large PDF. In older versions of Photoshop, the box “Preserve Photoshop Editing Capabilities” was turned on by default, and people didn’t know that they needed to uncheck it for documents that were going to be available for download. The other issue has to do with images. Images are embedded in PDF documents, unlike web pages where the images are downloaded separately, so the search bot has to download them with the document, even though the bot won’t index them. If these images are left at print resolution (144 or 300dpi) as opposed to screen resolution (72dpi) then they can greatly and unnecessarily inflate the file size.

The hack: During conversion, make sure the “Preserve Editing Capabilities” box is unchecked. Go into the compression function of your conversion program and ensure that any images over 72 dpi are downsampled to 72dpi during the conversion process. Check the size of your PDF. If it is still over 2 Megabytes, you still have some work to do. It might be time to lose the beautiful front-page cover image, or the image that appears on the footer of each page. When you are going online, every image that does not provide content needs scrutiny for its value. Remember, this isn’t just for Google. Today your customers are on mobile devices. They won’t appreciate a giant download.

There is an ongoing debate over whether or not you should avoid PDFs and express all of your content as HTML pages. PDF’s are great for manuals and other complex documents that will probably be downloaded, so you should use them appropriately. Just be careful what you publish online, follow a few simple rules and Google will love your PDF’s

Oinkodomeo provides sales consulting and sales enablement process improvement, training and coaching built on a foundation of sales and marketing alignment. We work with and evaluate emerging sales enablement tools and content management solutions.