Google Base, RDF, Semantic Web e molto fermento...

[ crosspost su VoIT ]

Che Google si stia muovendo in modo massiccio e’ ben noto, e in alcune sue iniziative e’ perfino difficile tenere il passo…

E’ il caso di questo nuovo servizio, disponibile dal 15 novembre in sola lingua inglese e del quale si parla molto in Rete, sia per capire in effetti cosa diavolo e’ e perche’ si dovrebbe usare, ma anche perche’ sembra essere il primo passo di Gooogle verso il mondo dei metadata, del Semantic Web e di RDF…
->Google Base

Cos’e’ Base secondo la Rete

Vediamo di capire cosa dovrebbe essere, facendo una carellata degli annunci in Rete, a partire proprio dal blog di Google:
-> First Base

Today we’re excited to announce Google Base, an extension of our existing content collection efforts like web crawl, Google Sitemaps, Google Print and Google Video.
Google Base enables content owners to easily make their information searchable online. Anyone, from large companies to website owners and individuals, can use it to submit their content in the form of data items. We’ll host the items and make them searchable for free.

E ancora:

Right now, there are two ways to submit data items to Google Base. Individuals and small website owners can use an interactive user interface; larger organizations and sites can use the bulk uploads option to send us content using standard XML formats.
Rather than impose specific schemas and structures on the world, Google Base suggests attributes and item types based on popularity, which you can use to define and attach your own labels and attributes to each data item. Then searchers can find information more quickly and effectively by using these labels and attributes to refine their queries on the experimental version of Google Base search.

E per finire:

This beta version of Google Base is another small step toward our goal, creating an online database of easily searchable, structured information.

La parte finale in particolare, “structured information” e’ quanto meno intrigante: pensiamo al mondo di RDF e ai micorformats… A quanto pare anche Google si vuole buttare nell’impresa usando pero’ solo XML…

Vediamo per completezza anche la pagina About di Google Base:

Google Base is a place where you can easily submit all types of online and offline content that we’ll host and make searchable online.
You can describe any item you post with attributes, which will help people find it when they search Google Base.
In fact, based on the relevance of your items, they may also be included in the main Google search index and other Google products like Froogle, Google Base and Google Local.

E ancora:

How it’s different: Google Base enables you to add attributes that better describe your content so that users can easily find it.
The more popular specific attributes become, the more often we’ll suggest them when others post the same items. Similarly, items that become more popular will show up as suggested item types in the Choose an existing item type drop down menu.

E dalle prime FAQ vediamone alcune:

Why should I use Google Base?
If you have information you want to share with others, but aren’t sure how to go about gaining an audience, Google Base is for you.
If you don’t have your own website, we’ll host your content for you. You’ll be able to choose labels and attributes that can draw more attention to the content you’re showing. And, based on their relevance, your items may appear on Google, Froogle, or Google Local.

Da seogoogle.it un paio di annotazioni interessanti:

Google Base serve per creare audience, visitatori ai tuoi contenuti.
[…]
Sembrerebbe insomma che Google Base, possa essere utilizzato come un normalissimo sito web per caricarci ogni tipo di informazione che si vuole veicolare e anzi, è nato apposta per dare visibilità ai contenuti.
Ora mi domando, quale la migliore integrazione per chi ha già un sito e vuole sfruttare le potenzialità di Google Base? .
Google ha sempre favorito chi ha subito abbracciato e adottato le loro novità (come coi feed rss addietro) e secondo me utilizzare subito Google Base potrebbe essere un importante vantaggio. Ma non intravedo il come. Duplicare i contentuti anche qui? Integrare la pubblicazione dei feed rss direttamente su Google Base?

Dal mio punto di vista il centro e’ che e’ un nuovo modo globale di aggiungere metadata a risorse gia’ esistenti, come i feed RSS ad esempio…
Quindi una volta aggiunte le info ai propri feed bisognera’ vedere se tramite le API o automaticamente il crawler di Google aggiungera’ le info anche in Base…

Intanto si stanno facendo delle prove, questo e’ un oggetto inserito ad esempio:

-> Darren Rowse

Consiglio la lettura di questi due primi report:

-> Google Base da Problogger.net

-> My Experience With Google Base

Legami con RDF e il Semantic Web

Danny Ayers come al solito e’ un passo avanti a tutti ed e’ stato il primo a provare il sistema da un punto di vista anche semantico, ecco i suoi due posts sulla questione:

Vorrei citare alcuni punti particolarmente significativi dei suoi interventi, come sempre da leggere in toto…

There doesn’t appear to be any direct way of accessing the (meta)data submitted - this is crying out for an open API.
[…]
I’ve got mixed feelings about the system. On the positive side, their use of RDF-like data structuring shows a lot of promise, and there’s implicit acknowledgement that text-mining based search is not the answer to every data access requirement.
[…]
The mere existence of Google Base may help encourage developers to take the (Semantic) Web of Data idea a bit more seriously (though what I saw was still very document-oriented).
The growth of folksonomies has already led a lot of people into the space between free-text indexing and rigid taxonomies, and it’s clear that when you use tech like RDF the two extremes are mutually exclusive - you can exploit the good points of both.
Google Base may be a few decades behind what can be done with Description Logics (such as RDF/OWL), but at least it’s a move away from the confines of hierarchies (XML/Gopher) and fixed record-oriented systems (SQL DBs) and towards a more flexible kind of relational approach.
Google already make quite a bit of URIs with LinkRank, I imagine this system will go further, though probably not quite so far as their significance on the Semantic Web.
[…]
It’s worth noting that by building a centralised system, a lot of these problems are bypassed. But it’s the distributed nature of the Web that gives it the power. Distributed control and ownership of the information is what gives us the power.

Alcune cose sono molto importanti e vengono riprese anche nel seguito del post, e vorrei far notare l’accento sulle ultime frasi segnalate:

it’s the distributed nature of the Web that gives it the power. Distributed control and ownership of the information is what gives us the power

Vediamo adesso cosa dice di questo nuovo servizio Shelley Powers, che ha sempre molta fantasia per i titoli dei suoi post:

-> The Mountain

Al di la’ della questione della montagna e di Maometto vorrei prendere alcuni stralci di quello che dice:

I agree with Danny Ayers in that Google Base is a step forward in the effort to get folks to think about how to annotate their material online…
My biggest concern about this service is the centralized, proprietary nature of this type of data store.
Right now, I have simple-to-use plugins installed on my weblog tool that automatically generate very rich metadata formatted as RDF/XML, available for all. If you use Piggy Bank or some other tool that can consume RDF, or any tool that can work with XML, you have access to this data. It can be easily and unambiguously combined with other data from the same or other sources, and queried using the SPARQL query language.
The ‘owner’ of the data is the originator of the data and whatever gatekeeping happens to the data is nodular and thus easily routed around.
In other words, the data is very web-like: structured, distributed, linked (discoverable), and malleable.
Google Base is interesting, but it isn’t web-like.
It’s architecture is contrary to Google’s own success, too, because the company’s processes have always gone to the source, rather than have the source go to it; earlier efforts that reversed this, such as the original Yahoo, have not been as successful.
With Base, Google has forgotten who is Mohammad and who is the Mountain.

I grassetti come al solito sono miei: la questione qui in effetti e’ di notevole interesse: abbiamo Shelley Powers che da sempre e’ una delle menti piu’ pragmatiche nei confronti di RDF e del Semantic Web e il lavoro che sta facendo anche con Wordpress e’ notevole e lo sto usando anche io…

Mi trovo d’accordo con quello che viene detto e trovo la sua analisi da tenere d’occhio su quello che Google sta combinando: in realta’ la centralizzazione del servizio sta minando il suo essere Rete, sta accentrando il suo potere…

Tra l’altro il punto chiave della discussione nei commenti successivi al post e’ la centralita’ dei dati raccolti da Google: essi sono chiusi all’interno di Base e cosa diversa e’ , come dice Danny Ayers giustamente, se viene aggiunto anche un servizio di query aperto, magari usando SPARQL…

L’esempio del suo sito [ shelley powers site ], del suo materiale in RDF facilmente aggregabile e interrogabile poi dal client e’ un esempio molto pragmatico di cosa si puo’ fare con RDF senza la complessita’ di OWL e del Semantic Web in generale: a mio avviso manca solo una queryendpoit per SPARQL e poi il quandro e’ completo… [ per l’ambiente RAP che usa ancora non e’ disponibile ]

Rimanendo sempre in tema di RDF e Base, vediamo alcuni dettagli del servizio…

Mi stuzzica parecchio l’estensione pensata da Google a RSS1, cioe’ ai feed RDF compliant…

-> RSS extensions by Google Base

Google has defined a new namespace (http://base.google.com/ns/1.0) to support these attributes. Are we seeing the first formal adoption of Semantic Web concepts (by Google) here?
Google Base let’s users create new schemas (attributes). For instance, an example from Google Base shows how “language_skills” attribute can be added to a job opening description. I wonder how these new namespaces are ingested by Google Base?

Condivido appieno i pensieri dell’autore, visto che aggiungere il proprio namespace e’ la tecnica giusta per aggiungere moduli al mondo di RSS1 e quindi di RDF…
E’ interessare e doveroso a questo punto dare un’occhiata alla pagina dove vengono spiegate le aggiunte fatte:
-> Section 1: Extending RSS 1.0

Come prima cosa hanno creato uno schema XSD, quindi totalmente XML based: anche se poi il contesto e’ quello di un RDF Schema…
Paura di nominare o usare apertamente RDF? Mah

Ma le cose interessanti sono altre:

gli attributi di default disponibili per gli oggetti item del feed RSS sono orientati al marketing e alla promozione di prodotti e persone, quindi appare maggiormente chiaro lo scopo di Google Base stesso: maggiore visibilita’ e caratterizzazione di quello che si vuole…
la possibilita’ di creare nuovi e propri attributi usando un nuovo namespace appositamente dedicato, dove ogni attributo e’ in pratica un XML type… a mio parere un limite progettuale molto forte

Come vedete l’approccio di Google sembra essere quello di introdurre un servizio basato sfacciatamente sui metadata e quindi potenzialmente nel range d’uso di RDF e invece usare tecniche sfacciatamente XML based, con tutti i limiti che ne conseguono…

Ora mi piacerebbe sapere se qualcuno piu’ esperto del sottoscritto conosce un motivo per cui nel mondo delle estesioni si possa usare un namespace c creato per gli schema personalizzati degli utenti possa essere unico: cioe’ come minimo si avranno delle collisioni perche’ con un namespace unico e un potenziale target mondiale e’ probabile…

In RDF l’apertura stessa del modello e l’uso corretto dei namespaces risolve questo dilemma: perche’ non usare un namespace basato anche sull’ID dell’utente, che e’ unico?

In realta’ hanno introdotto una variabile nella dichiarazione del namespace, una furbata non indifferente a mio avviso:

xmlns:[prefix]=”http://base.google.com/cns/1.0“
Google Base providers have the option of creating their own prefixes. The “g:” prefix is reserved for the Google Base XML module and should not be used. Also, prefixes beginning with the three-letter sequence x, m, l, in any case combination, are not acceptable.

In pratica lasciano la possibilita’ di nominare il namespace come l’utente desidera, ma usando in ogni caso lo stesso URI.

A livello logico non riesco a capirne il senso, visto che URI=NAMESPACE e’ anche un modo per derivare informazioni sullo schema stesso: cioe’ e’ consigliabile associare a quell’URI anche una URL che in base al client che chiede informazioni risponda in modo adeguato ( puo’ essere un agente software, ma anche il browser dell’utente che cerca info su quel determinato schema ).

E’ il caso di riprendere alcuni stralci dai posts di Danny al riguardo:

PS. A bit more about their data structure can be gleaned from the format descriptions they have for bulk upload (RSS 1.0, RSS 2.0, tab delimited, Atom 0.3) . The RSS 1.0 version is particularly interesting - you get a namespace in which you can invent your own simple datatyped properties. Unfortunately they’ve made a little mistake:
< c :prior_experience_years type=”int”>5</c>
unprefixed attributes are deprecated in RDF/XML Syntax Specification (Revised)
that’s not valid RDF/XML.

E ancora:

**Getting data into the system via RSS 1.0 might offer a good machine interface. **Here’s a custom attribute done that way, slightly tweaked from their example:
< c : x-prior_experience_years type=”int”>5</c>
That syntax could have originated as the RDF:
<http ://example.org/itemURI> x:prior_experience_years “5”^^xsd:integer .
Assuming we have a bit of a convention of how the bits are worked out, the templates in things like FOAF Output for WordPress could be tweaked to automatically push RDF into Google Base.

In pratica anche lui capisce che il nodo di unione tra i mondi di RDF e Base e’ proprio il feed RSS1 opportunamente modificato e magari autoprodotto in questo modo, usando anche i tipi nativi di XML.

E’ un punto sul quale si puo’ lavorare…

Conclusioni

Tirare le fila della questione non e’ cosa facile, data la mole di materiale e le possibilita’ offerte da questo nuovo servizio…

Pero’ un paio di cose si possono dire al riguardo….

Google sembra aver fatto il primo timido passo verso il Semantic Web, introducendo un layer XML di metadata al Web nel suo classico modo d’operare, cioe’ in grande stile e in beta
Il nuovo sistema e’ in pratica un nuovo modo di pubblicizzare il proprio contenuto nella Rete e offre visibilita’ a chi ne fa uso usando e integrandosi con le ultime tecnologie, tipo RSS e XML; un passaggio doveroso tecnologico e culturale verso il mondo dei metadata
Per quanto riguarda le estensioni proposte per RSS si possono ora pensare nuovi usi per pubblicizzare i listini dei propri prodotti e molto altro, creando liste RSS ad hoc facilmente inseribili in Google Base: una prima prova di quello che poi si potra’ fare di fatto con il Semantic Web dove tutti parleranno RDF
Da un lato potevamo aspettarci di piu’ da un gigante come Google, il mondo di RDF e’ ancora indietro e un passo avanti della grande G avrebbe dato una gran bella spinta: ma almeno introduce i concetti di metadata alla grande pubblico e poi si potra’ lavorare meglio e non e’ poco…
Con strumenti come XML Army Knife non e’ detto che non possiamo poi in un secondo momento sfruttare la base dati di Google Base per costruirci sopra i livelli RDF e molto altro: se esce anche una API il passo poi sara’ molto breve…
Da tenere d’occhio l’analisi fatta da Shelley Powers riguardo la perdita di importanza delle decentralita’ dei nodi per il business di Google e il suo nuovo modo di accentrare risorse su se’ stesso

Sono cose grosse, ma intanto una scrematura iniziale e’ stata data…

Se qualcuno vuole aggiungere qualcosa, e’ ben accetto come sempre…

Ne riparleremo, stay tuned :)

Google Base, RDF, Semantic Web e molto fermento...

Matteo Brunati