Recuperación de la información

Inicio

Information retrieval methods. The object of any filing or storage system is to make available at later date the information in the documents stored. Each collection will have its own special requirements and any system which meets those requirements will be an adequate retrieval system for that collection. If the number of documents in the system is very small, no formal retrieval system is required. It will be possible and adequate to scan all the documents. As further documents are added this soon becomes impossible and some means must be adopted for speeding up the search. The collection may now be divided into two parts, the documents themselves and an index to them. If a reference to the documents is placed in each appropriate place in the index, an information retrieval system has been created. The simplest system is a card index in which there in a separate card for each entry. The cards may be arranged either in a single or in a number of alphabetical sequences, which has the effect of scattering material on allied subjects, or in a classified order. The number of cards in such a system soon becomes large, and the labour of making the separate entries is great, so that either some desirable entries will not be made, so that information cannot be retrieved, or the index will become too cumbersome for easy use, and information in the collection will be ‘lost.’ Some easier and more effective method is needed.

If references to (or abstracts of) the documents are written on marginal punched cards (I), and these cards are notched to show all relevant headings a search for information on a number of headings, a search for information on a number of headings simultaneously becomes simplified, and the labour of maintaining the index is lessened. One position on the cards may be allotted to each possible subject heading and this is notched to show the presence of that characteristic in the document. The cards have only a limited number of holes, so that the number of headings can easily be greater than the available holes. There are several types of coding system which can be used to increase the capacity.

Subjects may be allotted numbers on a serial or classified basis, and these numbers be used in place of the subject heading on the cards either by allotting a separate set of holes to each digit or by use of a numerical code. Each type of heading may be punched into an appropriate field. Alternatively, the whole space on the card may be made into a single field and the codes for different subject headings be punched one over the other. This superposition of codes is most effective if the code combinations are chosen on a mathematically random basis. The number of false combinations which will be generated by interference between codes can be precisely calculated, and kept to an acceptably low proportion, by proper design of the system.

If the collection is large enough, machine sorted punched cards may be justifiable. Code numbers for the appropriate fields in the body of the card, and searching can quickly be done by machine. Even with high speeds of search, the time taken to answer a question may be unacceptably high unless care is taken to subdivide the index file so that only a portion of it has to be searched for any query. If the load on the sorter is high enough to make the system economic, the waiting time may make the overall search time longer than for simpler and cheaper systems.

Landau, Thomas. 1967. Encyclopaedia of Librarianship. London: Bowes & Bowes. Pág. 222

Information retrieval. Finding documents or the information contained in documents in a library or other collection, selectively recalling recorded information. Methods of retrieval vary from a simple index or catalogue to the documents, to some kind of punched card or microfilm record which requires large or expensive equipment form mechanically selecting the material required. Classification, indexing and machine searching are all systems of information retrieval.

Montague, Leonard. 1971. The librarians’ glossary of terms used in librarianship and the book crafts and reference book. Great Britain: The trinity press. Pág. 329

Before turning to searching strategies it is necessary to summarize the fundamentals of information retrieval. The act of retrieving information depends on that information having been stored, so more properly, we should think in terms of information storage and retrieval. Both the storage and retrieval functions consist of three analogous operations.

A) Storage

1.- The subject analysis of the document by an indexer;

2.- the translation of the concepts analyzed into the indexing language of the system;

3.- the organization of the files of which the data-base must be analyzed so that their subject content can be accessed. Traditionally this has meant that indexers read each document, decide on its subject content and then isolate the concepts they wish to store.

1.- The ´depth of indexing´ refers to the number of concepts isolated from each document, and this is a result of a management decision assessing the amount of intellectual effort available, so that the depth of indexing will vary from system to system.

2.- The concepts selected by the indexer are then translated into the indexing language employed by the system. Indexing languages are either controlled or natural. Those systems which employ natural language, of course, attempt to dispense with the intellectual words used by the author of the paper in his choice title, or by the abstractor, whether this be the author or an editorial abstractor who has summarized the paper. Natural language can be deficient as a medium for information retrieval because it is rich in synonyms, and the same thing, action, or concept may be described in the literature in different terminology. Various writers may use different words to describe the same concept and in the less scientific fields language will be used fancifully. There is substance in the generalization which divides terminology into two categories, hard or soft, ‘Hard’ terminology is almost always used by authors in the physical and chemical sciences: it is safe to say that physicists, mathematicians, and chemists generally choose clear and unambiguous titles for their papers, using terminology specifically defined in their individual disciplines by national and international standardizing bodies (eg British Standards Institution, American National Standards Institute, the International Union of Pure and Applied Chemistry, etc.). But in the social sciences and humanities the terminology used by authors is often ‘soft’: language used imaginatively. For example, a literature search on the psychological problems induced by drug use retrieved serious contributions to the literature with titles such as, ‘From pot to pot: a giant leap backwards’; ‘Grass: the modern tower of Babel’; and ‘Potted dreams’.

In the field of librarianship useful contributions on information retrieval have been published with the tittles, ‘How golden is your retriever’, and ‘On the construction of white elephants’, and a book on a famous national library was entitled Out of the dinosaurs. Thus when using a natural language data-base we can never be sure that we have retrieved all the documents in the base covering a particular subject by using only a single term or single phrase, and it may well be necessary to input a string of synonyms in order to achieve high recall.

As a consequence of such linguistic idiosyncrasies most data-base producers employ a controlled language which facilitates concept indexing. This controlled language standardizes the terminology by limiting the choice of words available to the indexers as indexing terms. The standardized list of terms and phrases used to describe things, actions and concepts are listed in a thesaurus. Synonyms and alternative ways of naming the concepts are provided as references directing the indexer or the searcher to preferred terms. In the field of information retrieval, for instance, peek-a-boo cards, optical coincidence cards, aspect cards, feature cards, etc. are all phrases used to describe the same thing.

A thesaurus covering this area might choose FEATURE CARDS as the preferred term and make reference from the alternative forms of phrase, eg:

OPTICAL-COINCIDENCE CARD use FEATURE CARDS; PEEK-A-BOO CARDS use FEATURE CARDS.

The thesaurus will usually be more than simply a list of preferred terms with synonyms – the relationships which exist between the preferred terms will be displayed. People interacting with a system, the indexers adding references to the data-base, and the users searching it, will be shown: 1 broader terms (BTs) or generic terms (GTs): those above the preferred terms in the hierarchy; 2 more specific terms (STs) or narrower terms (NTs): those below the preferred terms in the hierarchy; 3 those terms which are horizontally rather than vertically related to the preferred terms: coordinate or related terms (RTs).

The examples of thesauri reproduced on the following pages are page from the INSPEC thesaurus complied by the Institution of Electrical Engineers to assist the indexers and searches of the INSPEC data-bases to select the most appropriate terms for indexing and searching; the NLM thesaurus MeSH, Medical Subject Headings, used with the MEDLINE data-base, and the American Psychological Association’s Thesaurus of psychological index terms for use with Psychological abstracts.

3.- When the documents have been analyzed and the storage concepts identified in the language of the system they are entered into the data-base files. For on-line access all systems employ inverted files, ie files consisting of the most likely search parameters: indexing terms, keywords from titles and abstracts, authors’ names, etc. Each file is essentially a collection of sub-files as the accession numbers of the documents having a particular characteristic in common will be listed in the file against that characteristic.

For example, a file of keywords from the titles and abstracts of papers would give:

LITERACY

LITERAL

LITERALS

LITERARY

This organization allows us to create the sets of documents referred to earlier.

B) Retrieval

The stages involved in the retrieval of information are analogous to the three storage functions. They are: 1) the analysis of the search question; 2) the relation of the question into the indexing language of the system; 3) the formulation of the search strategy: the search proper, ie the matching of terms in the search strategy against the terms in the data-base.

1.- As users of information retrieval systems do not always specify their information needs precisely, defining the scope of a search can be a tortuous process. It is usually necessary, therefore, for the librarian or information scientist undertaking the search to liaise with the end-user in a reference interview to elucidate and define the scope of the search, and to establish the basic parameters which are to be related in the search strategy.

2.- This accomplished, the scope of each parameter is translated into the language of the system by reference to a thesaurus or classification schedule. The natural language statement of the problem under search elicited in the reference interview is then defined in the language of the system. A search strategy can now be expressed as a Boolean logical statement using the logical operators AND, OR, NOT. If the system does not utilize a controlled language, the strategy is expressed in natural language with each parameter defined as specifically as possible.

3.- The search is essentially a matching process in which the terms in the search statement are compared with those that have been assigned to the citations by indexers, or with those that are present in the titles and the abstracts of the papers in the data-base.

George Boole (1815-1864) devised a system of symbolic logic in which he used three operators: +, x, -, to combine statements in symbolic form. His work was later developed by John Venn who expressed Boolean logical relationships diagrammatically, adopting Euler circles. Leonhard Euler was a Swiss mathematician who introduced the technique of expressing logical relationship graphically a century before Boole devised his logical operators. Boole´s three operators are logical sum (+), logical product (x), and logical difference (-). All online information retrieval software packages are now designed to allow the searcher to specify his strategy by using these operators to link terms which have been selected to circumscribe the scope of the search.

Houghton, Bernand y Convey, John. 1977. On-line information retrieval systems an introductory manual to principles and practice. United States: Clive Bingley & Linnet Books. Págs. 19-26.

An Information retrieval (IR) system comprises the people, activities and equipment concerned with the acquisition, organization and retrieval of information. Discussion of IR usually assumes that an IR system is computer-based. Whilst this is increasingly the case, IR systems can be manual, and the definition given here would include all manually searched library catalogues as well as bibliographies, indexes and abstracting publications. Nevertheless, IR more commonly means retrieval from a computer system, whether the information is held on a local system, increasingly in CD-ROM form, or on a remote system accessed by a telecommunications network. The requires put to IR systems are one of two types: the search is either for a known item or for items in a particular subject.

In responding to requires, IR systems must achieve a balance between speed, accuracy, cost and retrieval effectiveness in revealing the existence of information items and displaying surrogates (representations) or the original items. The effective of retrieval is measured by the pair of measures recall ratio and precision ratio. The recall ratio measures the proportion of those relevant documents in a database which are retrieved, whilst the precision ratio measures the proportion of the retrieved items which are relevant. There has been since the late 1950s which has stablished that the two measures are inversely proportional. In general, steps taken to improve one measure of performance will have a deleterious effect on the other measure.

Feather, John; Sturges, Paul. 1997. International encyclopedia of information and library science. Great Britain: Routledge. Pág. 211

Information Retrieval and the Concept of Information

The term information retrieval (IR) is possibly one of the most important terms in the field known as information science. A critical question is, thus, why, and in what sense, IR uses the term information. IR can be seen both as a field of study and as one among several research traditions concerned with information storage and retrieval. Although the field is much older, the tradition goes back to the early 1960s and the Cranfield experiments, which introduced measures of recall and precision. Those experiments rang among the most famous in IS and continue today in the TREC experiments (Text REtrieval Conference) . This tradition has always been closely connected to document/text retrieval, as stated by van Rijsbergen (1979, p. 1):

Information retrieval is a wide, often loosely-defined term but in these pages I shall be concerned only with automatic information retrieval systems. Automatic as opposed to manual and information as opposed to data or fact. Unfortunately the word information can be very misleading. In the context of information retrieval (IR), information, in the technical meaning given in Shannon's theory of communication, is not readily measured (Shannon and Weaver). In fact, in many cases one can adequately describe the kind of retrieval by simply substituting 'document' for 'information'. Nevertheless, 'information retrieval' has become accepted as a description of the kind of work published by Cleverdon, Salton, Sparck Jones, Lancaster and others. A perfectly straightforward definition along these lines is given by Lancaster: 'Information retrieval is the term conventionally, though somewhat inaccurately, applied to the type of activity discussed in this volume. An information retrieval system does not inform (i.e. change the knowledge of) the user on the subject of his inquiry. It merely informs on the existence (or non-existence) and whereabouts of documents relating to his request.' This specifically excludes Question-Answering systems as typified by Winograd and those described by Minsky. It also excludes data retrieval systems such as used by, say, the stock exchange for on-line quotations." (Notes to references omitted).

In 1996, van Rijsbergen and Lalmas, however, declared that the situation had changed and that the purpose of an information retrieval system was to provide information about a request. Although some researchers have fantasized about eliminating the concept of document/text and simply storing or retrieving the facts or "information" contained therein, it is our opinion that IR usually means document retrieval and not fact retrieval. We shall return to the difference between documents and facts later, but here we want to show why information (and not, for example, document, text, or literature) was chosen as a central term in this core area.

Ellis (1996, pp. 187-188) describes "an anomaly" in IS:
Brookes noted the anomaly could be resolved if information retrieval theory were named document retrieval theory which would then be part of library science. However, he commented that those working in the field of information retrieval were making the explicit claim to be working with information not documentation.

What Brookes (1981, p. 2) stated was,
From an information science point of view, research on IR systems offers only a theoretical cul-de-sac. It leads nowhere. The anomaly I have noted is this: the information-handling processes of the computers used for IR systems, their storage capacities, their input and internal information transmissions, are measured in terms of Shannon theory measures—in bits, megabits per second, and so forth. On the other hand, in the theories of information retrieval effectiveness information is measured in what I call physical measures—that is, the documents (or document surrogates) are counted as relevant or non-relevant and simple ratios of these numbers are used. The subsequent probabilistic calculations are made as though the documents were physical things (as, of course, they are in part), yet the whole enterprise is called information retrieval theory. So why, I ask, are logarithmic measures of information used in the theory of the machine and linear or physical measures of information in IR theory?

If information retrieval theory were called document retrieval theory, the anomaly would disappear. And document retrieval theory would fall into place as a component of library science, which is similarly concerned with documents. But that is too simple an idea. Those who work on IR theory explicitly claim to be working on information, not documentation. I therefore abandon the simple explanation of a misuse of terminology. I have to assume that IR theorists mean what they say—that they are contributing to information science. But are they?’ (emphasis in original).

Ellis and Brookes should not refer to the opinion of researchers in their attempts to solve this problem. Only arguments count. In our view, it is not too simple an idea to claim that information retrieval theory is in reality document retrieval theory and thus closely associated with library science. It is not difficult to disprove Brookes's statement that information retrieval does not deal with documents. A short examination of the literature demonstrates this, and even if the Cranfield experiments spoke about "information retrieval," their modern counterpart, the TREC experiments, speak about "text retrieval." "Text retrieval" and "document retrieval" are often used as synonyms for IR.

If one read Brookes’s statement in the light of the relationship between the early documentalists and information scientists, it becomes clear that information scientists wanted to forge a distinctive identity to be both more information technology-oriented and more subjected-knowledge oriented. One reason for information scientists to prefer not to be linked to library science might be that important technological improvements were carried out not by people associated with librarianship, but by those affiliated with computer science. This preference is most probably the reason they claimed to work with “information, not documentation.” Nevertheless Brookes's statement is flawed, and it has provoked endless speculation about the nature of information, which has not contributed to an understanding of the problems of IR. (Compare the quotation by Schrader, 1983, p. 99, cited earlier).

The worst thing may be that information scientists have overlooked some of the most important theoretical problems in their field. Van Rijsbergen (1986, p. 194) has pointed out that the concept of meaning has been overlooked in IS. The fundamental basis of all previous work – including his own – is in his opinion wrong because it has been based on the assumption that a formal notion of meaning is not required to solve IR problems. For us it is reasonable to suggest a link between the neglect of the concepts of text and documents on one hand and meaning (or semantics) on the other. Semantics, meaning, text, and documents are much more related to theories about language and literature, whereas information is much more related to theories about computation and control. We do not claim, however, that the statistical methods used in IR have not been efficient. We do claim, however, that semantics and pragmatics, among other things, are essential to better theoretical development in IR, and in the long run also to the improvement of operational systems.

Capurro, Rafael y Hjørland, Birger. 2003. The concept of information. https://www-capurro-de.translate.goog/infoconcept.html?_x_tr_sch=http&_x_tr_sl=en&_x_tr_tl=es&_x_tr_hl=es&_x_tr_pto=tc

Proceso por el cual se accede a una información que ha sido previamente almacenada. Disciplina científica que estudia los procedimientos y técnicas de representación, ordenación, búsqueda, presentación y evaluación de la información en sistemas automatizados, con el objetivo de facilitar el acceso eficaz y eficientemente a la misma. Se trata de un ámbito científico de carácter interdisciplinar, que forma parte de la informática y que incluye la gestión de bases de datos y en general de objetos. Tradicionalmente, se ha ocupado de las estructuras de almacenamiento, métodos de indización, lenguajes de interrogación, estrategias de búsqueda, visualización de datos y evaluación de la recuperación. En la actualidad, el campo ha revolucionado con la incorporación de los documentos multimedia, los nuevos métodos de recuperación de información –basados en la clasificación automática, en la utilización de lenguajes de marcado de documentos, en la aplicación de sistemas expertos y de procedimientos de retroalimentación por relevancia y en la hipermedia- y los sistemas de visualización gráfica. Recuperación de la información es una apócope (supresión) del término más amplio “tratamiento y recuperación de la información”, una traducción libre en español del término anglosajón “information storage and retrieval”. A veces se utiliza conjuntamente con el término procesamiento de la informacion para abarcar prácticamente todo el campo de la informática lógica (information retrieval and processing).

López, José. 2004. Diccionario enciclopédico de ciencias de la documentación. España: Editorial Síntesis. Pág. 371

La Recuperación de Información (RI) no es un área nueva, sino que se viene desarrollando desde finales de la década de 1950. Sin embargo, en la actualidad adquiere un rol más importante debido al valor que tiene la información. Se puede plantear que disponer o no de la información justa en tiempo y forma puede resultar en el éxito o fracaso de una operación. De aquí, la importancia de los Sistemas de Recuperación de Información (SRI) que pueden manejar – con ciertas limitaciones – estas situaciones de manera eficaz y eficiente.

Tolosa, Gabriel; Bordignon, Fernando. 2007. Introducción a la recuperación de información: conceptos, modelos y algoritmos básicos. Argentina: Universidad Nacional de Luján. Pág. 9

Buscar este blog

Servicios y recursos de información

Recuperación de la información

Entradas populares de este blog

Introducción

Servicio de información