Machine-driven search matters more than ever
Customer-facing content sets, once hidden deep on websites and locked into print and PDFs, are now instantly searchable and discoverable. Helping Google and Bing, and our internal search engines, to find and deliver relevant content has become a key role of publishers across marketing and technical documentation.
Historical perspectives on structured data
One of the primary ways in which digitally-rendered content on the web differs from its printed equivalent is the ability of machines to ingest, analyze and index that web-based content. This in turn allows machines to return relevant content items in response to a search query, or otherwise surface interesting or useful content based on a user’s interests and needs.
In their efforts to understand what any given piece of digitally-provided content is about, enterprise search engines like Google and Bing are greatly aided by having that content made available to them as structured data
. As the name suggests, structured data is content provided in a very specific format that the consumers of this content explicitly understand.
In the early days of the web the ability of search engines to identify the precise facts expressed on a web page was only as good as their ability to parse unstructured
content and transform it into consistently-classified information. However sophisticated the approaches employed, search engines were still ultimately guessing about the data they found on web pages: three star icons encountered next to a restaurant review probably
meant that the critic’s rating was three, and that the rating was probably
out of five, but there was no way that a search engine could know this for sure.
In an effort to reduce such ambiguity Google in the late 2000s began to support HTML-based structured data markup
that allowed web publishers to provide very precise information about certain types of things, like reviews. Using the standardized vocabulary of microformats, a community initiative launched in 2005, or of data-vocabulary.org, a 2009 Google project, a web publisher could now declare that the star icons on a page represented a rating, and that the rating had a value of exactly
three out of a possible score of exactly
While these efforts allowed Google to produce the web’s first “rich snippets”, by which things like review scores could be displayed next to a page’s snippet directly in the search results, effective structured data use was hampered both by the lack of a single standard for Google, and the absence of any structured data standards whatsoever across search engines.
The example accompanying Google’s announcement of recipe rich snippets in April 2010. At this time the snippet was generated based on the microformat hRecipe, or from the Recipe item type at data-vocabulary.org, a predecessor of schema.org.
June 2011 Google, Bing and Yahoo! addressed the gap in structured data markup standards head-on by jointly announcing the availability of schema.org, a common set of standards that, in Google’s words
, “aims to be a one stop resource for webmasters looking to add markup to their pages to help search engines better understand their websites.” Russia’s largest search engine, Yandex, signed on to the initiative later in the year.
These standards provided both a common set of terms for publishers to describe things present in their web content (vocabulary), and approved methods of encoding this information for search engine consumption (syntax).
This degree of cooperation between search engines is rare (the only notable prior example being a 2006 agreement between Google, Microsoft and Yahoo! on a protocol for XML sitemaps), and the impact of their collaboration around schema.org on the adoption and utility of structured data cannot be overstated.
More so than previous efforts, schema.org is a living standard, and it has become more expressive over time to satisfy the demands of well-articulated use cases. Launched with just under 300 types (the things the vocabulary allows publishers to describe, like events or products or videos), schema.org today boasts more than 1,100 types.
The most commonly-used schema.org types eligible for Google rich results
Source: Class-Specific Subsets of the Schema.org Data contained in the November 2018 Web Data Commons Corpus
While the schema.org Steering Committee, which is representatively still search engine-heavy, has ultimate control over which new vocabulary is added to the schemas, non-search engine participation in the Steering Committee is now formalized, and there are multiple avenues for interested parties to participate in vocabulary development.
Search engine alignment also makes it more likely that web publishers will go to the trouble and expense of providing structured data markup, both because the benefits of doing so are not restricted to a single search engine’s results pages, and because the search engines’ commitment make it more likely that these benefits will be enduring rather than fleeting.
And those benefits, both for end users and web publishers, are not insubstantial.
Superior visibility in the search results
These are visually distinct search results that prominently display important schema.org-encoded values for specific types of content. For example, a recipe rich result might display the recipe’s ingredients, preparation time, calories and review ratings.
A recipe rich results in Google, with data sources illustrated
For eligible result types this can lead to a substantially better experience for search engine users, and especially when those users are searching on a mobile device. The presence of rich results makes it much easier for a user to assess the potential usefulness of a web page without having to visit it. For example, a user might be able to avoid clicks on pages for past events in a results page of events listings, or to skim recipes to focus on those under a certain preparation time, or to explore products only within a certain price range.
This benefit is extended to web publishers as well, insofar as those search engine users are more likely to be consuming only relevant content from a publisher, allowing them to avoid the aggravation of purposeless visits (and the negative association with the brand in question). And in situations where one publisher’s offerings are similar to another’s, both the visually distinct result and the information provided within it make it more likely that a search user will click on a rich result than on the plain “blue link” of a web page that lacks structured data markup.
Depending on the search engine, structured data might not only be required to generate rich results, but to include those same pages in search verticals that are a subset of the full search results. For example, job posting markup fuels the display of a block of job rich results in Google, with a “more jobs” link that clicks through to a set of result pages that display job postings exclusively: a job posting from a site without schema.org/JobPosting markup might
appear in the web results for a relevant query, but decidedly won’t
appear on the screen after a user clicks on “more jobs”.
Web pages with schema.org/JobPosting markup are eligible for rich results in Google (left), and appear in a dedicated job search vertical when a search user clicks through on “more jobs”
Increasingly, schema.org-provided data is also being used as a mechanism to enrich the search engines’ knowledge graphs. This potentially extends the utility of a piece of content from a listing in a search engine’s document index (those “ten blue links”) to a presence in that engine’s knowledge base of facts about things, with structured data-provided facts surfacing in features such as Google’s Knowledge Panels, or in search engine voice responses.
Search engine discoverability
Structured data markup makes it much easier for search engines to understand a web page, and in particular the entities and information about them to which a piece of content makes reference. While structured data use in itself does not provide web pages with boost in search engine rankings, the search engines’ superior understanding of pages with structured data makes it more likely that these pages will appear in the results for relevant search queries.
schema.org, for example, makes it easy for publishers to provide data that allows a search engine to determine which one of two or more similar entities a piece of content refers, like which of the four Wisconsin towns named “Springfield” an article is referring, or which of several similar products is being described on a given product detail page.
Search engines are increasingly leveraging structured data as a means of discovery for content that cannot otherwise readily be parsed to uncover meaning. For example, Google is using schema.org/Dataset markup to ingest publisher-provided information about datasets whose meaning would otherwise be opaque to the search engines, and provide searchers with results about the content of these datasets.
Similarly, structured data allows publishers to provide information that can be used by search engines to generate interactive experiences within search results. A search result for a song, for example, might include structured data-powered links that launch that song directly in a streaming service.
Search engine transformation
The power of structured data to fuel things like dataset discovery and media actions is largely predicated on the separation of a web page’s presentation layer (what a user sees) from its data layer (what a search engine consumes).
The provision of this data layer is a de facto method of generating intelligent content - content that is “structurally rich and semantically aware” - which in turn allows this content to be reconfigured and reused for additional publishing endpoints.
Search engines are now starting to transform web page-provided content to other endpoints to surface this content outside of web search results. For example, a recipe published using Google-prescribed schema.org/Recipe markup is automatically eligible to be returned as audio on a Google Home. Similarly, schema.org’s speakable property allows publishers to identify content that’s “especially appropriate for text-to-speech conversion”, and so for search engines to preferentially return this content in response to a voice search.
Web documents annotated with schema.org structured data may be returned as search results in numerous Google endpoints aside from their standard “10 blue links”
Just as schema.org has supported improved user experiences on mobile devices by reducing a searcher’s need to browse websites, so it now supports the easy ingestion of web-provided information on smart speakers and smart displays. , giving that content at least a fighting chance to appear in future search results, and on future devices.
Providing structured data may improve the discoverability of content by the search engines and result in higher visibility in search results, as well as potentially making web page-based content available on other devices.
All of the parties involved in the production and use of schema.org are potential beneficiaries of structured data. Search engines benefit by better understanding the content they’re indexing, and by being able to build new search products based on prescribed schema use. Publishers benefit by the additional exposure their content receives in the search results, both in terms of visibility and relevancy, and by the potential of that content to now reach more users on a greater variety of devices.
Whether from scratch, by using generators or with the help of developers, adding structured data to content should be on the agenda of any publisher interested in content discovery.