Digital Odyssey 2009

Digitization: Theory and Practice

Google Book Settlement

Posted by dodyssey2009 on June 12, 2009

The Google Book Settlement & its Impact on Libraries

Speakers: Sian Meikle, Digital Services Librarian, ITS, University of Toronto Libraries
Tony Horava, Collection and Information Resources Coordinator, University of Ottawa

Blogged by Jana Purmalis

This presentation, based on a recent paper prepared for the OCUL Directors, provided an overview of the terms of the controversial Google Book Settlement, as well as an exploration of some of its implications.

Sian Miekle opened the session with some background on the Google Book Search.  Established in 2004 in the US, the digital Google Book Search includes an estimated 7 million books that were scanned through partnerships with 20 libraries and several publishers.  The database includes material that is both in-copyright and out-of-copyright, as well as in print and out of print.

While Google claimed that indexing the full text of in-copyright materials for search and discovery only constituted fair use, the American Publishers Association and the American Authors Guild felt differently, and in 2005, two US class-action lawsuits were launched against Google (later merged into one).  In October 2008, a settlement was announced, pending approval of the US Court.  The Fairness Hearing is scheduled for October 2009. 

As noted by Ms. Miekle, the terms of the settlement are vast and complex, but here are some of the main points:

  • The lawsuit only concerns rights within the US
  • The settlement only covers books (not journals, newspapers, or other formats)
    • It includes all books in copyright that are covered by the Berne convention (anything published in the 164 countries that signed the Berne convention)
    • But only those books that existed prior to January 2009
  • It is a non-exclusive settlement, in that rights holders retain the rights to make agreements with other providers
  • Google would have to pay out $60 per title to rights holders for books scanned prior to May 2009 ($45 million total)
  • Google would have the right to sell online access to books and to create other related products
    • 63% of revenue stream would go to copyright holders and 37% to Google

Tony Horava, Collection and Information Resources Coordinator at the University of Ottawa, delved further into some of the major issues and questions raised by the terms of the settlement:

  • Even though the agreement is non-exclusive, Google will be a massive “one-stop shop” for online books, and very likely deter competition from other ventures
    • Will the settlement grant Google a de facto monopoly? What about Google’s influence on the market pricing of out-of-print books? What will happen to the smaller players?
    • How will the ebook aggregators (for example, Netlibrary and ebrary), be affected? Google has deep pockets for R&D. 
    • How will Amazon be affected? Google wants to build an open platform for selling e-books online that would be device and platform-independent
    • If and when it is available in Canada, what will the pressure be like for libraries to acquire it?
  • Although revenue generated from out of print titles would theoretically be split 63-37% between rights holders and Google, respectively, orphan works, whose rights holders cannot be located, comprise 79% of the out of print material
    • Under the settlement, only Google gets rights to provide and sell these orphan titles online
    • More revenue for Google
  • Privacy issues
    • The service will require user identification, and book products are to be housed on very few server
    • To what extent will Google be able to monitor and track user reading habits?
  • Intellectual freedom and transparency
    • The Research Corpus is to be “carefully gate-kept” and contents of the Rights Registry are not publicly available
    • Using the Research Corpus for “non-consumptive” research requires that a research agenda first be approved by the host institution
  • Equity of access
  • Long-term security of data

To conclude, Mr. Horava emphasized the importance of careful monitoring and advocacy with stakeholders, encouraging audience members to contemplate the potential impact of the Google Book Settlement on the future of libraries. 

Posted in Session Blog Entries | Leave a Comment »

Large Scale Digitization at Toronto Public Library

Posted by dodyssey2009 on June 12, 2009

The Luminary Library Experience: Large Scale Digitization at Toronto Public Library

Speaker: Johanna Wellheisser, Manager of Preservation and Digitization Services, Toronto Public Library

 Blogged by Jana Purmalis

Johanna Wellheisser presented a discussion on large-scale digitization at the Toronto Public Library (TPL), addressing the various stages of the process, as well as some of the challenges and possible future directions for the project.

Ms. Wellheisser started the session by discussing some of the ways in which the TPL has ventured into “uncharted waters” as part of its long-standing commitment to collecting and preserving materials, with large-scale digitization being one of the more recent developments.  Ms. Wellheisser was careful to distinguish the TPL’s “large-scale digitization” efforts, which involve the creation of discrete collections or document sets, from “mass digitization” projects such as the Google Book Search, which occur on a much grander scale, require minimal human intervention, and emphasize quantity over quality. 

In the past, digitization demanded that books be unbound or flattened, meaning that digital conversion was undertaken at the expense of the original item.  Furthermore, the process was extremely labour intensive and costly.  More recent technological developments, however, have improved the efficiency of the digitization process, enabling exciting new opportunities for presenting content.  For example, newer technology offers features such as open-book capture, robot page-turning, dual camera capture, and processing software.  Scanning standard items requires less human intervention and raises fewer preservation issues.  Ms. Wellheisser identified four key companies that provide digital conversion equipment:

  • Treventus
  • Qidenus
  • 4DigitalBooks
  • Kirtas (used at TPL)

With large-scale digitization becoming a more feasible endeavour, the Luminary Library Project was established in 2007 as a partnership between Kirtas, Ristech, and Amazon and various university and public libraries to create Print-on-Demand material for Amazon.  As Ms. Wellheisser explained, the TPL’s goal is to digitize and make available, both freely online and for print-on-demand, 10,000 volumes of copyright-free pre-Confederation Canadian imprints over the next five years.  The basic driver behind the project is access, or as Ms. Wellheisser put it, to allow “discovery anywhere, any time, any place.” As of 2008, 1,000 books had been digitized and delivered, and the TPL is currently on target for its goal to deliver 2,000 more books by the end of 2009.  About 1,025 books are available to purchase online at Amazon.com in its Historical Reproductions Store.  

There are several stages in the digitization process:

1) Planning – a key stage at the outset of the project.  Several sets of guidelines must be established, including those for collection selection, tracking items, image quality standards, pricing models, and project timelines.

2) Selecting materials and choosing an order of production that maximizes efficiency.  For example, if books of similar size, height, and weight are scanned in succession, the setup of the digitization equipment requires fewer adjustments.

3) Scanning books using Kirtas APT BookScan 2400 and APT Manager software (may require troubleshooting).

4) Image editing:

  • Superbatch processing of raw JPEG images using BookScan Editor software
  • Resulting bitonal TIFFs are edited further to make either global or individual page adjustments.  Could involve cleaning up margins, straightening text, etc.

5) Product delivery. Bitonal TIFFs are converted to PDFs and delivered on a weekly basis to Kirtas via ftp, and are accompanied by metadata as ONIX xml files.

But the process is not without its challenges.  Ms Wellheisser identified several obstacles faced by the TPL’s digitization crew:

  • Significant learning curve associated with new hardware and software
  • Underestimating time required to complete project
  • Historical documents are often problematic for scanning because of their fragility, size, faded text, contrast, bleedthrough, and foldouts
  • Nature of materials demands that most scanning is manual rather than automatic, therefore sacrificing higher levels of productivity
  • Editing images to meet quality control standards can be extremely time-consuming and curb productivity; common trade-off between image quality and production levels
  • Library/vendor relationship is still in early stages of development

Ms Wellheisser closed the session by looking to the future.  She expressed interest in seeing the existing set of Kirtas users (which includes the TPL, McGill University, and McMaster University) transition toward a more formal user group.  She also mentioned talks with Amazon to improve its current user interface, as well as plans to run OCR in the long-term and explore the possibility of Digitization-on-Demand.  Going forth, the TPL will continue to look for different ways of thinking about its collections and making them accessible in new ways.  

Posted in Session Blog Entries | Tagged: , , | Leave a Comment »

Planning and Managing a Digitization Projects

Posted by dodyssey2009 on June 11, 2009

Blogging by: Robert Keshen

Speaker: Loren Fantin, Project Manager, Our Ontario.

In planning and managing a digitization project, we need to ask the 5 w’s: Why digitize? When (timelines)? What are your goals? Who is the audience? How will you do it?

The How is often most quoted, but can be the easiest to do. At the begining, it is important to think of how the data will be created, how it will be managed, how the digital content can be made discoverable, and search functionalities. It is not enough to just have a project; it must be part of a larger program. This program should include a mission, objectives, and be aligned with organizational goals. Funding must also be taken into account, and the digitization should be part of the overall mandate, with funding set aside for it.

When starting a digitization project, partnerships should also be sought out. Shared resources and regional representations can result from these partnerships. An example of a successful collaboration is Picture St Marys, where the library partnered with the museum to digitize their collection for the betterment of the community.

Project fundamentals include project planning, staffing, workplans and workflow, risk management, measurement and evaluation, and promotion. It is important to note that the planning part of the project will last throughout the entire project. Plan, plan, plan, Loren stresses, and then plan some more. For staffing, important issues that should be addressed include whether it will be internal or external, the hiring process, skill sets and responsibilities, and their impact. It is important to pass on the knowledge internally, so that if someone leaves the project, their knowledge does not leave with them.

Budget should be determined based on task and time required per task. Workplan and workflow is determined based on responsibilities, timelines, best practices, policies, and the digitization process. Other planning requirements include risk management, scope creep (keep and adhere limitations that were in the workplan), and measurement and evaluation.

Promotion often slips people’s minds but it is very important. There are two types of promotion strategies: push and pull. Push strategies include press releases, brochures, emails, presentations and advertising. Pull strategies include collaboration, events, and user feedback. There is no need to wait until project is over before promotion begins. Make it part of the process to make people aware of what you are doing.

Digital Collections – created when digital objects are selected and organized for discovery, access and long-term use. This includes metadata, objects, and user interface. A good collection is created according to an explicit collection development policy and is sustainable over time.

Metadata – a digital project without metadata cannot be found, and it is essential that it be shareable and translatable to other organizations. Different types of metadata includes descriptive, administrative, technical and structural. The goal is consistency for your staff and your organization. This would include minimum requirements, including title, physical description, a unique identifier and a unique URL.

Issues – include copyright, which should be determined for EVERY object. Ownership and permission must be established. Digital rights and terms of use must also be established to determine what the end user is allowed to do with the data. A recommended citation statement would be helpful. A decision must also be made as to whether the digital conversion will be done in-house or if outsourcing will be utilized. When implementing, a master copy and lesser quality versions should be produced.

The two biggest factors to success are the planner’s attitude and their ability to be creative. Loren gave an example of a project at Brock University that had no funding, so used 30-day free trials to digitize canal maps.

Management – a naming convention should be used with unique identifiers for each object and data. Once the digital collection is made, user discovery, access and interaction must be examined. Users are an integral part of the equation. Users begin in many different places, including itunes, facebook, twitter, etc. Where users are not is the institutional website, so there has to be a presence on these services. Some tips include: adding a link to the end of a wikipedia article, including a photo page on flickr with links, and a facebook group. Search widgets on external webpages are also helpful. User feedback is also important, as it allows the community to interact and add to your collection.

Posted in Session Blog Entries | Tagged: | Leave a Comment »

Send us your presentations…please!

Posted by digitalodyssey on June 11, 2009

Thanks to all presenters and speakers for having provided us with another successful Digital Odyssey day this year!

Please submit your slide presentations as attachments to insideolita@gmail.com and we’ll make them available, ASAP on the blog.

Posted in General Information | Leave a Comment »

The ContentPro Solution [Tom Adam]

Posted by dodyssey2009 on June 8, 2009

Speaker:
Tom Adam
Research & Development Librarian: Information Literacy
Western Libraries
University of Western Ontario

Blogger:
Brian Park

Adam spoke about the importance and usefulness of digitization for both academic and public libraries and described how ContentPro is helping his team at Western Libraries digitize their unique collection. He began by stating that the increasing demand for digital information has made it necessary for libraries to amass all forms of digital content, from photographs of artifacts to video footage of interviews. He also referred to the fact that library patrons are seeking more intuitive, hassle-free catalogues that resemble the openness and accessibility of Internet search engines. For Adam and his colleagues, their solution to the challenges of digitization was integrating ContentPro as an extension to their online databases. Some of the basic features of Content Pro include the following:

  • ContentPro can work with images, sound files, movies, PDF text, etc.
  • simple interface and web-based submission allows easy file upload and metadata description
  • uses thumbnails of images for visual browsing
  • provides a number of security and administrative features including collection branding, content censorship, limiting viewing size, etc.
  • stores digital files both locally and externally
  • batch metadata and content import
  • dedicated ContentPro server
  • enhances access through “Encore” discovery layer (branded in Western Libraries as “Quick Search”) – a search function that allows users to search all collections at once
  • users can also choose to focus on a single collection by searching with keywords

In addition to simplifying the process of digitization, the use of ContentPro has given Western Libraries the opportunity to form partnerships with eight academic and public libraries across the world. Adam mentioned that a huge international community of libraries has been created on the basis of weekly online meetings and a feedback system to continually improve the usability of ContentPro.

At the same time, ContentPro also played a significant role in bridging the gap between Western Libraries and other resource branches within the University of Western Ontario. Adam announced that a partnership between Western Libraries and the McIntosh Gallery was successful in fully digitizing Delaquerrière’s photo album, a rare collection from the 19th century, by using ContentPro.

Near the end of his presentation, Adam identified a number of problems with the current build of ContentPro, including copyright considerations, program bugs, quality control of uploads and the standardization of databases. However, despite these issues, Adam concluded by stating that ContentPro holds limitless potential as a research tool due to its remarkable accessibility and patron-friendly features. To substantiate his support for ContentPro, Adam proudly showcased Felix Blangini’s “La Fée Urgèle,” a full opera manuscript that was made digital and public in its original form on ContentPro three weeks ago.

Posted in Session Blog Entries | Tagged: , , | Leave a Comment »

OCR Options for Scanned Content [Art Rhyno]

Posted by dodyssey2009 on June 5, 2009

OLITA Digital Odyssey 2009 session by Art Rhyno, Systems Librarian at the University of Windsor,and a member of the Our Ontario Technical Committee

————————————————————–
Art identified 3 big factors in successful OCR:

  • the importance of fonts
  • languages (and their character sets)
  • the use of dictionaries

The accuracy rates are “claimed” by OCR software as over 90-95%… but this is just not true in the real OCR world… especially  with newspapers as the starting content where names and content are just not in the dictionary!

OCR is most common, but don’t forget ICR – a recognition engine for motion-based characters like handwriting.

Major OCR engines include ABBYY, Omnipage Nuance, ReadIRIS, Oce, CharacTell, Verus, Parascript…  Trial software lets you test the engine against YOUR content… consider downloading and trying several to see best results – common commercial products include;

  • Abbyy FineReader
  • AdobeCapture
  • ReadIris

OCR  engines are now coming along in the open source world as well… Google uses the tesseract-ocr engine (but tweaks it heavily).  If you want to see more about the opensource tesseract ocropus OCR engine, you can visit http://sites.google.com/site/ocropus/ocr-resources and see the full list for this engine and the supports, resources, etc.

As the publisher of the Essex Free Press, a smaller community newspaper with archives going back over many years, Art focused his OCR comments and examples on the heritage newspapers… The “problem” with newspaper content is the sheer volume of the material. reels and reels of microfilm are the usual source files ( or… even worse, microfiche!) Many of the small community newspapers participated in the move to microfilm as a way of throwing away the backlog of print archives…. so a microfilm or fiche may be the only source available. And the volume of content is HUGE –  An average small weekly paper will generate a usual 16 page paper or 800 pages per year … these all accumulate dense text content.

Scanned and stored on microfilm reels- eventually these reels will involve about 27 hours of processing per reel, given current OCR engines – so there is a lot of wait which the scanned content is processed. (whether using the server or desktop based version)

One of the more interesting costs of OCR licensing for the engines is the fact that the density of content and the size of a newspaper page eats up and actual 3-4 pages of license! Even so, costs are quite reasonable. Samples of ABBYY OCR’ng and the actual microfilm were shared and the common kinds of errors were noted.

Art noted that the Output file is really big! With Abbyy, 255 characters of output get generated PER character of scanned content – but this data gives rich output such as positional information and character confidence.  (yup – that was a 255:1 ratio!)

Why do you want this incredible level of information?

  • Hit highlighting can be enabled for the end-user when keyword searching
  • Questionable characters  and words can be highlighted for OCR correction and editing

There is an evolving open source OCR space evolving as previously noted; ocropus uses the tessarract engine. This works just fine for scanning from paper sources.  Art compared a dense page of content OCR’d by both Abbyy and Ocropus… the output was highly comparable, with ABBYY getting a few more words correct, and Ocropus slightly doing better with numbers and equations. However, Ocropus is using hOCR as the source file – not as granular as the XML backside of ABBYY. Art prefers the option of getting into the backside and tweaking.

Art spoke of the current use of approximately 260 workstations at the Leddy Library, U odf Windsor – in the off-hours the Abbyy processing is spread across clustered workstations – this vastly inproves the processing timelines! There is the potential among the university community to setup clustering services (maybe in a “seti-like” share?) to help with the load? Art noted that is was a bit tricky to set up – machines need to be on the same subnet, have a  90-100 CPU use (so can not be doing anything else), and do not do any duplicate checking. He noted that Ocropus can also be set up this was using the Cywin base tool on each workstation.

One of the key things noted is that as the processing gets done,  it’s a really good habit to add proper semantics to the file names – provide issue, page, and date info as a semantic naming convention.

Another item of note was the perfect marriage of Lucene and SOLR as the indexing for the OCR’d content – creating searchible content with match-up to the digital scanned page, with hit highlighting.

As a wrap-up:

  • the commercial options are certainly there; open source is catching up fast
  • it can be done in house … ( it will EAT CPU cycles), or it can be shipped out… keep it in if you can.
  • the more pristine the microfilm is, the better the output will be – few small papers will actually have this… some universities may have originals available, the master negative is best.

———————————————-

blogged by windsordi

Posted in Session Blog Entries | Tagged: | Leave a Comment »

Blogs and links

Posted by dodyssey2009 on June 3, 2009

Coverage of this event coming soon!

Posted in Session Blog Entries | Tagged: | Leave a Comment »

The Perfectability of Data

Posted by dodyssey2009 on June 3, 2009

Blogging by: Robert Keshen

Speaker: Walter Lewis, Information Architect, Our Ontario. Walter also co-authored “River Palace”, a book about a steamboat built in 1985.

There are approx. 17,000 records in Our Ontario, and their metadata has a story. Walter is not speaking about perfect data; this is an odyssey of data, the progression towards perfectability.

Is what we are doing good enough? Is this just the minimum that the system will accept, or is it the best our current system will allow? At Our Ontario, they are looking for the best data from other sites to best direct people to their websites. Sharing metadata creates traffic, and therefore there is a desire for the best data possible.

Walter wants to discover how we can re-examine data to better suit our needs. He began by looking at the RSS feed, using pictures to show how RSS feeds can be visually represented to make the data more interesting. The same information can also be accessed in Google Earth. Through Google Earth, the same data can be accessed, but with the additional information of latitude and longitude. Thus, we end up with layers of information on data from different sources. RSS provides channel information, title, url, date of record creation, and description. MediaRSS provides other RSS stuff, along with enclosure including file name (url), file size, and mime-type. For Google Earth, a KML file was created, which includes feed element, title, url, lat/long, and description with embeded HTML. An important factor here is that the data was separate, but can be combined for our own purposes.

Granularity of metadata – The first issue that should be dealt with is the “name”, which is highly structured due to AACR/RDA. If you can separate names and know what the names mean in context, it will provide much more meaningful data. The bigger issue is for the data set to have internal consistency. Punctuation should not be used for the separation of data. External consistency also matters. You are looking for your system to store external identifiers like LCSH, dates, and place names.  The goal is to allow others to visualize our data and make good use of it.

Where do we start? We do not start with Dublin Core. It is all about aggregating stuff, and for a local system it is about output.  Dublin Core by itself is only good for generally bringing things together, not for extracting particular data. What we need is internal consistency. Walter gives the example of his own work, dealing with an article from the Georgetown Independent from May 12th, 1895. The paper had many different names and spellings of the names, different places, and different forms of dates, including the incorrect year, which should be 1985. For dates, Dublin Core is not that helpful. It can result in ambiguous dates, partial dates, uncertain dates, and date ranges. These type of things have to be dealt with in our systems. Walter deals with this by tagging them year-starting month for consistency. So how do we deal with dates more effectively? It is important to sort the different fields and get organized. The same technique can be used for places, both geographical and spatial.

The important message is that any data trumps no data. Having records, even if they are confusing, is better than not having them. Once the time is spent cleaning up the data, sorting it and working through it, the payoff is huge. To do this, we need a title, persistent ID, a URL, name and code for the contributor, and many other pieces of data that can be mapped and sorted. This will result in meaningful metadata which, in turn, results in meaningful data.

Posted in Session Blog Entries | Tagged: | Leave a Comment »

Are you attending Digital Odyssey? Send us your blog.

Posted by digitalodyssey on June 2, 2009

Are you going to Digital Odyssey?

Send us your blog, twitter identity (in addition to proposed hash tags e.g. #do2009 #digitalodyssey, etc.) and any related media coverage (flickr? YouTube? so we can link and send additional traffic your way.

We’ll post your submitted links on the InsideOLITA blog AND digitalodyssey blog.

We look forward to reading and viewing your coverage of this event as you record it!

Email us: insideolita@gmail.com

Posted in General Information | Comments Off

 
Follow

Get every new post delivered to your Inbox.