web
Survey of web scraping tools: which is the best web
scraper in the land?
“We’re happy to compare various competitors head-to-head to help you
understand exactly what you’re getting when you use xSkrape.”
“Web scraping” (aka “screen scraping”, aka just “scraping”) is the act of obtaining data from pages and other sources on the Internet using tools - sometimes automated (e.g. a bot or spider) - in other cases with a human involved: we could simply be trying to retrieve specific parts of a page or data set to include in a larger query or data analysis, and scraping tools are our means to get the data in a format we need.
Of course we’re biased at CodeX since we believe xSkrape has the best web scraping features, compared to many others. We’re happy to compare various competitors head-to-head to help you understand exactly what you’re getting when you use xSkrape.
Web Scraping Tools |
Feature Area |
Commentary |
xSkrape |
Access |
Available from Excel add-in, or calls directly made to our published REST API. Similar functionality available in xSkrape for SQL Server, SQL-Hero and CodeX SSIS Components. Future: Google Docs add-on as well. |
Pricing |
“Free Tier” offers use of any sub-feature of the product up to a monthly limit after which you can purchase additional usage credits which generally have an age limit. If you use the product occasionally, it’s fine to let credits expire and receive / purchase more only as needed for any workload that exceeds the free tier. |
|
Versioning / Licensing |
Latest version is always available in web environment (API and Excel add-in installed via Office Store). The version that can be installed locally (executable used to install) is licensed per version / machine. The XS.QL language is evolving continually in a way that existing queries will not be broken as we move forward. xSkrape is more than just a web scraper: all features are shared in a common tool-box. |
|
Features |
- Combine data from multiple requests into a single data set - Query data sets using SQL-like expressions - Interpret JSON and XML “as if tabular” - Avoid using regular expressions or XPath for many kinds of text access: use a simpler set of expressions that we document thoroughly - A “component-based architecture”: this means can offer comparable functionality (e.g. the XS.QL language) in different tools, be it Excel, SQL Server, an API, etc. |
|
Curl |
Access |
Command-line utility and library. |
Pricing |
Free. Open source. |
|
Versioning / Licensing |
MIT derivative license. Source available on github. |
|
Features |
- Plenty of low-level data access considerations are covered - Suited for those doing development work in C - No inherent “data recognition features” – e.g. recognition of tabular data is not its intent - Distance to land your data in a clean form in a place like Excel: “very far” |
|
YQL |
Access |
REST API |
Pricing |
Free. |
|
Versioning / Licensing |
Available for both commercial and non-commercial use, based on Terms of Use. Rate limited. |
|
Features |
- Query language looks like SQL - Aggregate over multiple feeds - Query into different formats (e.g. XML, HTML) - Distance to land your data in a clean form in a place like Excel: “pretty far” (break out the VBA!) |
|
Excel (Web Data
Source) |
Access |
Included as part of Excel. |
Pricing |
Included as part of Excel. |
|
Versioning / Licensing |
Included as part of Excel. |
|
Features |
- Can identify tabular data within web pages (visual identification / selection) - Your ability to get data of interest depends on Excel’s ability to parse it. This does not work well with some kinds of data (“many” in our experience!). - Can parameterize and do some operations using VBA. - Different refresh options supported |
As you can tell, a lot of the popular web scraping tools today are geared towards programmers (which xSkrape does support with its public API). However, a key goal is to provide easy access to data without having to write any code.
What does xSkrape offer for a user-interface that other web scrapers don’t? Here’s an example of a data extraction task that pulls in data from an RSS feed:
The source RSS feed looks like this:
Notice: zero coding was required to get a nice, flattened, tabular list of items! Zero mapping effort, too! This is a “default behavior” that can be tweaked using XS.QL, with the point being that many scraping tools don’t offer this kind of a) up-front intelligence, b) ability to land it nicely in a tool like Excel (soon to include Google Sheets), c) ability to describe what you want (versus coding which is explaining how to get it).
This approach works equally well for JSON sources too. For example, we can do this with a JSON source that looks like this:
Becomes:
The ability to perform this zero effort loading as a default when no other tabular data can be identified is a relatively new feature (8/10/16), also demonstrating how we’re constantly improving.
The xSkrape samples gallery includes examples that show how we can be more precise and override the default parsing behavior shown above, to extract specific pieces of XML and JSON documents using techniques such as join sets.
Please note that the act of scraping using any scraping tools is governed by the Terms of Use of web sites that are being “scraped”. We discuss this at more length on our website.
Our future plans for xSkrape include:
- Generate alerts based on XS.QL queries that we can run on your behalf. What if your competitor’s price changes? Get an alert. Or a certain measure you calculate based on what’s published on three different web sites changes by more than 10%? Get an alert. We’ll be leveraging our messaging tools to let this happen, along with potential mobile apps.
- Continue to evolve XS.QL to solve new problems. Tell us your problem: we bet we can solve it!
- Of course, xSkrape isn’t just a web scraper: it has messaging, security and other data features that make it an all-purpose tool.
We invite you to evaluate our web scraping tools both inside and outside of Excel. You can be using it for free today when dealing with usage that falls into our free usage tier. If you grow beyond that, we’re confident the cost will be significantly lower than a “built it yourself” solution. If you have any questions at all, don’t hesitate to ask us! We respond to every submission.
Did you like this article? Please rate it!