Technical workshop on Cross-searching digital collections

On the 21st September, Visualising China held a very useful workshop in Bristol, looking at the problems of cross-searching multiple remote collections by providing users with a search over integrated data, harvested periodically from target sites/services.

The workshop aimed to enable shared understanding and expertise as opposed to attempting to define a wholesale solution to the myriad problems surrounding cross-searching. Problems range from how to do the ranking of search results to how to sustain a cross-search service when searchable endpoints inevitably change their delivery mechanisms over time.

Presentations were offered from SOAS, the Connected Histories project, Visualising China and Edina’s Aggregations Scoping Study.

Summary of Presentations

Visualising China – described how we in all cases harvest and store metadata, as opposed to storing binary image files or texts. We use a Linked Data/Semantic Web solution to harvest, store and index:
1. A Java client library to the Google Books Libraries API to support building RDF from that datasource,
2. Query via SQL on the Image collection metadata from the server in France (where are hosted our Historical Photographs of China), with data transformed to RDF for the backend, and
3. We similarly XSLT the Belfast XML metadata (Sir Robert Hart image collection) to RDF.
We use a homegrown solution to cross-query across multiple SPARQL endpoints – Arnos, software we have developed for this and other projects. Further issues discussed here included, how to auto-generate/capture geo-locations, timelines and controlled vocabularies across disparate content, people/resource identity matching across interrelated collections, index coverage to support the appropriate level of user searching, searching user generated content and the scalability of the Arnos solution.

Connected Histories – described how they periodically harvest and create search indexes (using Lucene) on content collections – primarily Old Bailey proceedings online. They are also doing automated entity recognition technique via NLP on some of the unstructured OCR-derived data. Their federated search facility has been upped from 9 datasets to now cover 14. A key issue here then is scalability – and they doubt that using RDF/Semantic Web will give them a scalable solution. Other issues include negotiating lengthy license agreements with commercial providers, and the time consuming process involved ‘understanding’ the nature of the metadata/content from each different provider.

Workshop corridor chatSOAS – Theirs is a joint project with Yale University, connecting Arabic resources across the two parties. The mapping of metadata schemes to provide an integrated search has proved difficult, as the interpretation of common standards (such as Dublin Core) has not been applied in exactly the same way by both parties. Malcolm Raggett talked in depth about the importance of providing solutions that are fully appropriate to an end-user and what the technical implications of these are.

EDINA scoping study – The Research Discovery Taskforce vision is to have a “collaborative, aggregated and integrated research discovery and delivery framework”… So this study looked at the sub area of this: Aggregations of Metadata about Images and Time based Media. The study will release its final report in the coming weeks, and Sheila Fraser gave an indication of what it will reveal. She discussed five different aggregator models they came across in the study (through surveys and targeted interviews). Issues and findings discussed here included: aggregations tailored to subject disciplines may well be the best approach, complex multimedia metadata is still needed (in lieu of improvements in image recognition for example), common metadata schemas do not prevail therefore there’s the suggestion that use of a hybrid aggregator model – combining with a Linked Data approach – could be the most workable solution, user tagging usefully enriches metadata, aggregations should be made available to other aggregations and user centred design and focus is essential in this area. Sheila also demo’d the VSM Portal Demonstrator (http://edina.ac.uk/projects/vsmportal/).

Feedback from participants after the workshop was very positive. Participants are all keen to stay in touch to share reports and experiences. Many thanks to all who came!

This entry was posted in Digitisation. Bookmark the permalink.