Open database of scientific publications ITMO UNIVERSITY

Journal

Scientific and technical journal «Priborostroenie»

Не указан

UDK

Issue:12 (66)

Annotation

The state of the Internet as a repository of information resources is analyzed from the point of view of a bot - a program that collects data for the purpose of monitoring resources, filling a search engine, or other commercial or research purposes. An approach is proposed to describe the problem under study through a set of phenomena that arise when collecting documents on the Internet. The described phenomena must be taken into account when developing monitoring systems or search engines. A number of features that arise during web scraping, harvesting and other cases of using bots to collect data on the Internet are given. The problems of using subdomains, recursive subdomains, dynamically loaded content technologies, search engine optimization of text content and others are described. It is shown that the task of collecting data from Internet resources is not only technological, but also to a greater extent knowledge-intensive, and since research is in an active phase, there is no “out-of-the-box” solution for it. The article will be useful to researchers in the field of Internet development, search engine developers, specialists in data retrieval and Internet technologies, as well as specialists in the field of creation and support of Internet resources and in the field of Internet marketing.

PHENOMENOLOGICAL DESCRIPTION OF INTERNET DOCUMENTS COLLECTING AND PROCESSING

Scientific and technical journal «Priborostroenie»

Annotation

Keywords

Постоянный URL

Articles in current issue

PHENOMENOLOGICAL DESCRIPTION OF INTERNET DOCUMENTS COLLECTING AND PROCESSING

Scientific and technical journal «Priborostroenie»

Annotation

Keywords

Постоянный URL

Поделиться

Articles in current issue