Analysis and extraction of information from scanned documents

At present, despite the numerous programmatic documents issued by government authorities or professional associations, there is no coherent digitalization strategy for the library collection of documents. The eLibrary Builder subproject aims to electronically convert a collection of documents of about 5 million pages without affecting their original state. Furthermore, a smart search engine and indexing options will be added to the digitalized collection of documents. In this way, the valuable original documents will be preserved and made instantly available to an unlimited number of users. The project has the following main objectives:

  • Creation of a unique digital depot shared by the four Central University Libraries, which will become a genuine National Digital Educational Library;
  • Development of a document quality optimization system especially for those documents with certain spelling particularities;
  • Construction of certain efficient algorithms to recognize the characteristics of the pages;
  • The establishment of good practice norms in the digitalization field that will reunite the technical protocols regarding the document format and selection criteria;

The innovation of the project consists of the four points mentioned above and the scanning technology used in this project. A completely automated system will be used, with a scanning capacity of over 2000 pages/hour, which will be purchased by the consortium leader and will comply with the processing requirements for old and newer documents in different formats. This system will be provided with the latest IT applications in order to recognize texts difficult to be searched.

The types of documents to initially populate the digital depot will be selected from the following categories: manuscripts, archived documents, multimedia document texts, serials and books from the following categories: 1. General information: Information Science. Bibliology. Library Science. Standardization. Civilization and Culture. Reference works: encyclopedias; dictionaries, biographies, bibliographies, biobibliographies; bibliographic researches; 2. Public Administration: Social Assistance, Military Sciences; 3. Theology; 4. Art; 5. Legal Sciences; 6. Economic Sciences; 7. History: Archaeology, Archival Records; 8. Philosophy; Psychology; 9. Politics; 10. Literature; 11. Linguistics, Philology; 12. Sociology: Demographics, Statistics; 13. Ethnography: Folklore; 14. Pedagogy; 15. Natural Sciences: Geology, Geography, Biology; 16. Exact Sciences: Mathematics, Physics, Chemistry; 17. Applied Sciences: Technical Sciences, Engineering, Agronomy, Medicine, Pharmacology.