Some people eat, sleep and chew gum, I do genealogy and write...

Wednesday, October 14, 2015

How many genealogically important records are indexed?

Among genealogists, indexing historical records is a hot topic. Historically, genealogists had to search records one page at a time, unless there was an index created by the originator of the record. For example, when searching New England Town Records, I sometimes find that the Town Clerks created indexes for their own benefit in finding individual entries. Additionally, some books have indexes (indices).

But why indexes? What are they anyway? If everything is now on computers, can't we just look it up on Google or whatever?

OK, this subject goes way back. We need to start with the idea of organizing information. Originally, this was mainly books. Until libraries got large enough so that the people familiar with the books could no longer find everything, there arose a need to catalog the items in the library. Early cataloging systems relied on topical organizations. The first acknowledged modern systems of classification is attributed to Jacques Charles Brunet for the Paris Booksellers in 1842. See Wikipedia: Library Classification.

Classification systems became more sophisticated as the number of books and other items in the libraries grew. In my early library experience, I became familiar with the Dewey Decimal Classifications first published in 1876. The Family History Library in Salt Lake City, Utah and the Brigham Young University, Harold B. Lee Library where I now serve, both use a modified Dewey Decimal Classification system. I also became very familiar with the Library of Congress Classification System through my work as a bibliographer at the University of Utah, J. Willard Marriott Library.

Presently, many libraries use the services of OCLC. Here is a description of the service from their website:
OCLC is a global library cooperative that provides shared technology services, original research and community programs for its membership and the library community at large. We are librarians, technologists, researchers, pioneers, leaders and learners. With thousands of library members in more than 100 countries, we come together as OCLC to make information more accessible and more useful.
The most visible product of the OCLC is WorldCat.org.

But the classification of books does not give us access to the information contained in those books. In addition, genealogists are mostly involved in search individual documents and when searching such documents often involves reading the entire document. The idea of having a system, other than the historic manual indexing and classification systems, had to wait until the computer revolution and the subsequent distribution system known popularly as the Internet or Web.

Presently, the task of indexing involves a multi-step process. The text in a document or book has to be converted to a digital format. This involves using a device, such as a scanner or digital camera, to make an electronic image of the document (or page in a book) by converting the image into a series of electronic signals. In many cases, the patterns of letters within the images can then be further "read" or converted to an electronic text file through a process call optical character recognition (OCR). The text version of the book or document can then be searched by programs developed to look for certain combinations of letters, number and in some cases, symbols. There are presently millions upon millions of digitized books online.

Where this system breaks down is with handwritten (script) documents. Programmed recognition of handwriting is still in its infancy. For genealogists, this is a major obstacle. In nearly all cases, even if we have digital copies of the documents now available, we still have to read through them page by page searching for ancestral information. All the digitization efforts and classification systems of the world do not help us when we finally get to the documents we need. Granted, the digitized documents do, in many instances, become far more available than the paper originals, but we are still involved in the time consuming task of finding our ancestors in the documents ourselves.

Indexing then promises to solve this access issue. The idea is that people, volunteers or hired employees, will manually search through the documents and enter additional information about the contents, usually the names, dates and places mentioned. There is an underlying assumption that these indexes are beneficial to the genealogical researcher. There are however many limitations to indexing projects, some of which, are not well recognized.

There is a difference between a classification or cataloging system and an index. The catalog identifies books and documents that have similar content. An index purports to tell the user where a certain item of information, such as a name is located within the document or book. For those books and documents that have been completely converted to digital format and converted to text, every word is searchable. The user can find a name or any other string of characters in the entire document. In addition, the accuracy level of OCR systems is extremely high.

Indexing involves human intervention at different levels. First, the entity creating the index must select the terms to be indexed. An index is not a complete text search. Only selected items within the text are indexed and therefore available to the user. The indexing entity makes the determination of which items are significant. In addition, the people actually extracting the information from the documents are limited by the condition and readability of the originals and many other extraneous factors. Let's just say the process does not always work perfectly.

The main issue with indexing is the need to always check the original documents. In many cases, relying on an index is unwise. Just because the name you are searching for does not appear in the indexed search, does not mean it is not present in the original document. Relying solely on indexes as a finding aid is unwise and may be misleading.

This brings us to a difficult situation. Many genealogists rely entirely upon indexes. In fact, the response that I frequently see among beginning researchers at the suggestion that they may need to examine entire documents page by page, indicates that this is not commonly done. The more serious issue however, is the fact that only a vanishingly small percentage of the overall documents available throughout the world have been indexed. Now, when I make such a statement, my reasoning must be subject to justification.

It is apparent that vast quantities of books and documents have already been digitized. However, the manually, time intensive indexing process has largely made no real dent in the huge numbers of documents available throughout the world. Many documents are being initially captured digitally today. For example, this post is not going through a "paper" stage. This will be helpful for searches done in the future, but genealogists are concerned primarily with historical documents.

Because there are many types of institutions and entities that are involved in indexing historical documents, obtaining an accurate number or percentage of the documents presently indexed is likely impossible. However, in discussions with those doing the indexing, my impression is that only a very small percentage of the documents have been indexed, even those available already digitized online. In addition, as I noted above, lack of an effective way to read handwriting with computer programs rather than human intervention is one of the greatest obstacles.


No comments:

Post a Comment