Google, Bing, Yahoo and your metadata
Inigo San Gil (MCM), Stéphane Corlosquet (Aquia) and Adam Shepherd (ESIP - WHOI)
After years of suspense, the wait is over: The big three search engines have chosen a standard (aka specification) to provide information contributors with better mechanisms for describing information resources. The search engines improved the classification and sorting of information, resulting in a better experience when searching content on the web. When we say "content", we include datasets. This is the reason why data keepers should put attention to these particular advances by the main search engines and the reason we wrote this brief article.
The Google-Bing-Yahoo-Yandex chosen specifications reside in schema.org. The initiative was announced in June 2011, followed by workshops and early adopters (such as the White House). The first author of this article became aware of the Schema.org initiative during the last IM/ESIP meeting (ESIP, 2014). Here we expand on the Schema.org related topics covered at the ESIP Schema.org hack-a-thon session (Fils and Shepperd, 2014).
This article offers you a light view of the dataset specification at schema.org, a practical way to catch up with the schema.org specification, along with a motivation -- why would LTER comply with yet another metadata specification. The main merit of Schema.org adoption is to mitigate the failure in data discovery when the data seeker uses the main internet search engines.
Discovery: searching and finding.
Data Discovery is a hot topic for all LTER sites. What is the purpose behind Information Discovery science? Simply, the purpose of data discovery is to offer web visitors the best experience when searching for information resources, including data and research projects. A typical US-LTER site manages and serves thousands of unique information resource units including hundreds of scientific datasets. In this article, we refer to a dataset loosely as a group of contextualized scientific measurables. Structuring these information resources and making them easy to sort through is not a trivial task. At a much larger scale, the internet search engines are continuously improving their solutions to the similar but larger scale problem: sorting out information. In recent years, the race to offer the most relevant contextualized content has intensified. Specialized search results have been driven in part by the explosion of mobile devices -- download speeds and view port constraints forced search engines to be even more precise as, paradoxically, internet speed, device processor speed and screen size have decreased for the most popular devices used to utilize the Internet. Data set managers can take advantage of these new advances in search science. The specialization of the data-set annotations presented by schema.org represents one of the most relevant advances for scientific information managers.
The Internet of things1 will bring to you the Internet experience results that you are looking for by virtue of better indexing, cataloging and exposure to easily connectable services for data and information. You may count on the big search engines to make this happen; after all, those companies have all the ingredients to perfect the art and science of information management.
For over a year or more now, you surely noticed the mobile-friendly rich snippets that google offers as a result of a search. Locations, corporations, all sort of things you look for appear now in a brief vignette with relevant info, which may include a title, short description, geo-location and temporal relevance. Depending on the information sought, the bits highlighted vary. We experimented with these new features. We looked for "Lake Hoare" using Google. We also repeated the search using Bing and Yahoo with similar results. Here we present information for the Google results. The Lake Hoare search at the time of writing showed a rich snippet on the right side of a Chrome browser using a Windows Desktop computer.
The results offer a photo, a map and a brief summary of relevant data such as surface elevation, area, length, width, mean depth and the inflow sources. There is also attribution information (credit to the sources).
Similarly, the same google search using Safari on a 64Gb iPod touch 5G yields an even more brief rich snippet, which includes the photo, mini-map and 2 informational tokens (Elevation surface and Area), plus a "Read more" tab. It is also worthwhile to mention that the classic results come after this rich snippet.
Likewise, a dataset search on google, if annotated appropriately, may yield a similar snippet, perhaps a representative graph and a few key information placeholders, such as the where (a map) and the when, plus an attribution link.
What is Schema.org?
Schema.org describes itself as follows: " provides a collection of schemas that webmasters can use to markup HTML pages in ways recognized by major search providers, and that can also be used for structured data interoperability (e.g. in JSON). Search engines including Bing, Google, Yahoo! and Yandex rely on this markup to improve the display of search results, making it easier for people to find the right Web pages." The charge is to provide a unified structure to accomodate descriptions of all things, and use them for better discoverability. The interoperability dimension will not be discussed here.
Google, Yahoo and Microsoft's choice of vehicle to annotate datasets is not EML, the Biological Data Profile, the Dublin Core, the Darwin Core or any of the 1911* family of standards promoted by the International Standards Organizations. The good news is the actual Schema.org concepts and implementations are synergistic to those adopted by LTER over a decade ago; however, the new conventions and their particular technical implementations are different enough that you may have to deal with a fresh adoption process. Structurally, the markup is not encoded in XML, but rather simple HTML attributes, with proper namespaces. The juxtaposition of defined elements is far from 100% relative to LTER's usage of EML. Details about the concept overlap and a complete detailed discussion of the mappings between the current used standards and Schema.org is beyond the scope of this article. However, the authors encourage the adopters to revise our current implementation and report missmatches and possible new mappings.
Schema.org technical details
The Schema.org conforms to a hierarchical structure. At the top of the hierarchy, there are two elements and concepts: The Thing and the DataType. The most relevant element for the purpose of this article is the DataSet which is under CreativeWork (Note: do not get trapped by the term's semantic connotations - for now please see it mainly as hierarchical plausible placement). The element CreativeWork has the following differentiated information placeholders branching out:
about, accessibilityAPI, accessibilityControl, accessibilityFeature, accessibilityHazard, accountablePerson, aggregateRating, alternativeHeadline, associatedMedia, audience, audio, author, award, awards, citation, comment, commentCount, contentLocation, contentRating, contributor, copyrightHolder, copyrightYear, creator, dateCreated, dateModified, datePublished, discussionUrl, editor, educationalAlignment, educationalUse, encoding, encodings, genre, hasPart, headline, inLanguage, interactionCount, interactivityType, isBasedOnUrl, isFamilyFriendly, isPartOf, keywords, learningResourceType, mentions, offers, provider, publisher, publishingPrinciples, review, reviews, sourceOrganization, text, thumbnailUrl, timeRequired, typicalAgeRange, version, video.
The placeholders above contain familiar items, such as creator, publisher, datePublished and keywords. There are also a few that may have a match with placeholders we have been using to describe data sets. For example, alternativeHeadline, citation, version, inLaguage and contentLocation. Many of these terms are also parents of other terms in this hierarchy of things, such as "creator" which is of the type "person".
Many of the CreativeWork concepts that seem to be encapsulated in placeholders such as hasPart, isPartOf, mentions, comments, typicalAgeRange, interactionCount, contentRating and interactivityType are not considered in our EML schema descriptions. Perhaps some of them deserve a second look, specially those that indicate relationships, such as isPartOf and hasPart. Relations between datasets and information is one of the weakest parts of EML. The EML relational potential exists, but in practice it never translates into discoverability through relationships. For example, our network lumping techniques resulted in overly complex hierarchies.
Finally, the Schema.org DataSet adds a few data-set specific properties to the properties inherited from the CreativeWork parent category:
- catalog - A data catalog which contains a dataset
- distribution - A url that points to the data-resource
- spatial - The geo properties of the data
- temporal - A date range that characterizes the data-set
Schema.org and DEIMS: An implementation case.
One of the advantages of adopting DEIMS is that you ride along a community of developers that can do. Our colleagues at ESIP were already developing an extension to accomodate the mappings from our database to the HTML rendering of schema.org categories. The latest version of DEIMS is schema.org compliant out of the box.
What if you are using Drupal, but not DEIMS? At the time of writing, some LTER sites (e.g., Virginia Coast Reserve, Coweeta, California Current and Palmer LTERs) use Drupal in their hybrid information management systems. Perhaps these sites can still leverage the Drupal work. I will describe what we did for DEIMS so perhaps you can re-use some of these steps. First, we installed the Drupal Schema.org module. The install process involved issuing 2 commands on the DEIMS server's bash shell:
% drush dl schemaorg and
% drush en schemaorg schemaorg_ui
You can do this first step the traditional Drupal way -- installing a module like any other Drupal contrib module without using drush.
The next step involves configuring the newly installed schemaorg module. This step can be broken down into two general sub-steps.
The first configuration sub-step: Using your favorite browser, visit the edit dataset content type (the URL tail will look like admin/structure/types/manage/data_set). In the vertical menu, locate the schemaorg settings, and using the autocomplete associate the DEIMS dataset content type with schemaorg's 'Dataset' type. See figure below.
The second configuration step is to map all the mappable fields from the DEIMS Dataset content type to Schema.org's Dataset specification placeholders. For example, the "Abstract" maps to Schema.org's "About". For our mapping, we clicked on the manage fields tab for the Dataset content type, and looked for the "Abstract" field in the new form. We clicked on edit for that abstract field, and then at the bottom of the abstract field configuration form, we found a schema.org mapping autocomplete text field, which we used to find "About" and map the "Abstract". See figure below.
We repeated this DEIMS-field to Schema.org Dataset property mapping process for all the fields we thought were mappable. That was all there was to it: the rendered HTML produces the markup that optimizes the job of big search engine robots.
If you are using XSLT to render your Dataset pages on webpages, I would suggest you add XLST directives to produce Schema.org compliant attributes to your resulting HTML. THe process is undoubtedly more laborious, but not too daunting. Rendering content as HTML attributes is one thing the XSLT may do reasonably well.
Testing Schema.org implementation
Once you have worked your way to big SEO rich snippet compliance, it may be time to see your results. Google provides this testing tool to examine whether your markup is produccing the results as advertised: http://www.google.com/webmasters/tools/richsnippets
The tool would allow you to write a url, and receive a redux-report of what google is able to parse out.
The landing page of the tool also loads the results, which seems logical and less entropic. Let's test McMurdo's Glaciers data!
The result of the test is good. The data set title is highlighted, just as expected. How about the rest of the metadata? Many attributes were parsed correctly, and it even does well with dates, the bête noire of metadata. Here is a screenshot of some of the attributes, as parsed by google:
There is some work to do, but mostly the test gives satisfactory results. You may also wonder how many days it took us to implement this. The answer would not be of much guidance to many: I just applied the schemaorg module to DEIMS, hooray for Steph Colorsquet and others from the Drupal community.
Discussions: It is Google, Microsoft, Yahoo! and Yandex.
Before concluding the article, we would like to encourage you to make an effort to mark your datasets according to the rules set forth by the big three search engines and Yandex. There are many scientists that will not use google to search for a dataset, but some scientists will, and undoubtedly many internet users will bump into rich snippets featuring your site data sets once these have been properly marked up. Another good reason to adopt these rules is the derived tools that will likely be built to work with the Google, Microsoft and Yahoo specifications. We will want to leverage some of these tools, and chances are these will be developed at a faster pace than those supported by small groups of poorly funded initiatives.
Adoption of Schema.org may boost data discoverability, but it is also about taking advantage of the potential that schema.org and the companies behind it offer, and having the opportunity to help shape what interoperabilty might be in the era of the Internet of Things.
At the time of writing we could not offer a comparative analysis of the user experience, as the novel implementation has not yet gathered sufficient statistical convergence to discuss any possible improvements.
1 The internet of things is expected to offer advanced connectivity of devices, systems, and services that goes beyond machine-to-machine communications (M2M) and covers a variety of protocols, domains, and applications. [Wikipedia - the internet of things]
Fils, Dough and Shepherd, Adam. Schema.org Hack-A-Thon. ESIP 2014 Summer meeting, Frisco, CO. Resource at: http://commons.esipfed.org/node/2557
Schema.org. Here is the thing. https://schema.org/Thing
The internet of things. About 43,300,000 results as of Nov. 2014. Google.
Web Schemas http://www.w3.org/wiki/WebSchemas
Corlosquet, Stéphane. 2011. Drupal contrib module schemaorg. http://www.drupal.org/project/schemaorg
Barker, Phil and Campbell, Lorna M. The Learning Resource Metadata Initiative, describing learning resources with schema.org, and more? Nov, 2014. Webinar. Resource at http://bit.ly/1pKiCUj