Most of my readers know that in addition to being a runner, I’m trained as an oceanographer and have been working in that field since 1998. Since I don’t blog very often anymore, most of my readers probably don’t know that I left the field in October and started working as a software engineer at Accusoft. I’ve had a lot of fun working with the SaaS team developing PrizmShare, a fast and easy way to share documents of all types in the cloud. In my short tenure at Accusoft, I’ve played a large part in developing the site search feature of PrizmShare. This involves extracting the text from documents of all types and indexing it so that it can later be searched.
Some of the documents shared on PrizmShare can be quite large and contain 100MB or more of text. For so many reasons, I believe all of this text should be easily searchable (unless the owner wishes to keep it private). In the near future, we’ll likely expose all of that text to our internal search engine. While thinking through that project, I asked myself (and the team) how much we should expose to external search engines like Google, Yahoo! and Bing.
In an ideal world, the answer would be, “all of it” but exposing hundreds of megabytes of plain text on every visit to a document’s page would not provide an ideal experience to a user or a search engine bot. It could also get very costly for the company.
We can easily deal with the user experience by dynamically loading more text as the user scrolls down on the page. This is how things already work within our document viewer. The plain text is for those who use text to speech software or other technology that requires plain text rather than an image of the document.
The plain text is also what gets processed by search engine spiders, but these automated programs aren’t so good at acting like humans. They don’t trigger events (clicks, scrolling, etc) on a page like humans do. So, they can only be relied upon to read whatever text is originally delivered with the page. Where is the tipping point between providing maximum text to the search engine spiders and losing the optimal user experience?
Search engines have their own performance issues. They are keen to keep their own user experience optimal, so it makes sense that their automated spiders would probably have a limit to the amount of data they retrieve for a given page. What is this limit? That’s a little more difficult to find out. With the exception of an experiment from 2006 and some speculation, I’ve been unable to find anything about the topic. So, I decided to conduct my own experiment.
I chose three domains that I own, but I’m not currently using. Each domain received an index file that points to 5 other files. Each of these files contains the entire text of Alice in Wonderland, followed by a series of random words. The files are 250KB, 500KB, 1000KB, 2000KB and 5000KB, respectively in size. Within the series of random words, a unique set of 10 “gibberish” characters was placed every 10KB. The gibberish sequences appear no more than once in a given document and no gibberish sequence appears in more than one document.
Once the files are indexed by the major search engines, I should be able to ascertain how much of each file was indexed by performing a domain search for each gibberish sequence. I’ll post the results when I’ve got them. In the meantime, you can check out the experiment at the following three domains: