May 2010 Linkscape Update (and Whiteboard Explanations of How We Do It)
Posted by randfish
As some of you likely noticed, Linkscape’s index updated today with fresh data crawled over the past 30 days. Rather than simply provide the usual index update statistics, we thought it would be fun to do some whiteboard diagrams of how we make a Linkscape update happen here at the mozplex. We also felt guilty because our camera ate tonight’s WB Friday (but Scott’s working hard to get it up for tomorrow morning).
Linkscape, like most of the major web indices, starts with a seed set of trusted sites from which we crawl outwards to build our index. Over time, we’ve developed more sophisticated methods around crawl selection, but we’re quite similar to Google, in that we crawl the web primarily in decending order of (in our case) mozRank importance.
For those keeping track, this index’s raw data includes:
- 41,404,250,804 unique URLs/pages
- 86,691,236 unique root domains
After crawling, we need build indices on which we can process data, metrics and sort orders for our API to access.
When we started building Linkscape in late 2007, early 2008, we quickly realized that the quantity of data would overwhelm nearly every commercial database on the market. Something massive like Oracle may be able to handle the volume, but at an exorbitant price that a startup like SEOmoz couldn’t bear. Thus, we created some unique, internal systems around flat file storage that enable us to hold data, process it and serve it without the financial and engineering burdens of a full database application.
Our next step, once the index is in place, is to calculate our key metrics as well as tabulate the standard sort orders for the API
Algorithms like PageRank (and mozRank) are iterative and require a tremendous amount of processing power to compute. We’re able to do this in the cloud, scaling up our need for number-crunching, mozRank-calculating goodness for about a week out of every month, but we’re pretty convinced that in Google’s early days, this was likely a big barrier (and may even have been a big part of the reason the "GoogleDance" only happened once every 30 days).
After processing, we’re ready to push our data out into the SEOmoz API, where it can power our tools and those of our many partners, friends and community members.
The API currently serves more than 2 million requests for data each day (and an average request pulls ~10 metrics/pieces of data about a web page or site). That’s a lot, but our goal is to more than triple that quantity by 2011, at which point we’ll be closer to the request numbers going into a service like Yahoo! Site Explorer.
The SEOmoz API currently powers some very cool stuff:
- Open Site Explorer – my personal favorite way to get link information
- The mozBar – the SERPs overlay, analyze page feature and the link metrics displayed directly in the bar all come from the API
- Classic Linkscape – we’re on our way to transitioning all of the features and functionality in Linkscape over to OSE, but in the meantime, PRO members can get access to many more granular metrics through these reports
- Dozens of External Applications – things like Carter Cole’s Google Chrome toolbar, several tools from Virante’s suite, Website Grader and lots more (we have an application gallery coming soon)
Each month, we repeat this process, learning big and small lessons along the way. We’ve gotten tremendously more consistent, redundant and error/problem free in 2010 so far, and our next big goal is to dramatically increase the depth of our crawl into those dark crevices of the web as well as ramping up the value and accuracy of our metrics.
We look forward to your feedback around this latest index update and any of the tools powered by Linkscape. Have a great Memorial Day Weekend!
Leave a Reply
You must be logged in to post a comment.