DOCUMENT REPRESENTATION FOR CLUSTERING OF SCIENTIFIC ABSTRACTS
Annotation
The key issue of the present paper is clustering of narrow-domain short texts, such as scientific abstracts. The work is based on the observations made when improving the performance of key phrase extraction algorithm. An extended stop-words list was used that was built automatically for the purposes of key phrase extraction and gave the possibility for a considerable quality enhancement of the phrases extracted from scientific publications. A description of the stop- words list creation procedure is given. The main objective is to investigate the possibilities to increase the performance and/or speed of clustering by the above-mentioned list of stop-words as well as information about lexeme parts of speech. In the latter case a vocabulary is applied for the document representation, which contains not all the words that occurred in the collection, but only nouns and adjectives or their sequences encountered in the documents. Two base clustering algorithms are applied: k-means and hierarchical clustering (average agglomerative method). The results show that the use of an extended stop-words list and adjective-noun document representation makes it possible to improve the performance and speed of k-means clustering. In a similar case for average agglomerative method a decline in performance quality may be observed. It is shown that the use of adjective-noun sequences for document representation lowers the clustering quality for both algorithms and can be justified only when a considerable reduction of feature space dimensionality is necessary.
Keywords
Постоянный URL
Articles in current issue
- PLASMON SOLITONS, KINKS AND FARADAY WAVES IN TWO-DIMENSIONAL LATTICE OF METAL NANOPARTICLES
- STUDY OF CHARACTERISTICS OF SPECTRAL INTERFERENCE SIGNALS IN THE NEAR INFRARED SPECTRAL RANGE
- LOGIC WITH EXCEPTION ON THE ALGEBRA OF FOURIER-DUAL OPERATIONS: NEURAL NET MECHANISM OF COGNITIVE DISSONANCE REDUCING
- EXTRACTION OF MATERIAL PARAMETERS FOR PLASMON MULTILAYER FROM REFLECTION AND TRANSMISSION COEFFICIENTS
- ANALYSIS METHOD OF ANISOTROPIC LIGHTGUIDE h -PARAMETER DEPENDENCE ON ITS BENDING RADIUS
- PROCESS METHODS WITH LOW LEVEL OF OPTICAL LOSSES FOR THE MICROSTRUCTURED FIBER LIGHT GUIDES
- MULTI-ZONE ANTIREFLECTION COATING ON A SUBSTRATE MADE OF OPTICAL ZINC SULPHIDE
- STRUCTURE CONTROL FOR DIFFERENT TYPES OF PAPER BY ATOMIC FORCE MICROSCOPY
- ENERGY-SAVING TECHNOLOGY OF CHEMICAL AGENTS MELTING BY LIGHT RADIATION
- TECHNOLOGICAL IMPERFECTIONS OF FORCE ROD GEOMETRICAL PARAMETERS FOR PANDA OPTICAL FIBERS PRODUCTION
- VERIFICATION OF PARALLEL AUTOMATA-BASED PROGRAMS
- DATA ANALYSIS BY SQL-MAPREDUCE PLATFORM
- ON THE EFFECT OF ADAPTIVE USER INTERFACES ON RELIABILITY AND EFFICIENCY OF THE AUTOMATED SYSTEMS
- TASKS MAPPING METHOD FOR COARSE GRAIN RECONFIGURABLE SYSTEMS
- IRI-2012 MODEL ADAPTABILITY ESTIMATION FOR AUTOMATED PROCESSING OF VERTICAL SOUNDING IONOGRAMS
- ACCURACY EVALUATION OF THE OBJECT LOCATION VISUALIZATION FOR GEO-INFORMATION AND DISPLAY SYSTEMS OF MANNED AIRCRAFTS NAVIGATION COMPLEXES
- DETECTION OF BACTERIA IN FOODSTUFF BY MACHINE LEARNING METHODS
- USAGE OF BC7 CONTAINER FOR STORING TEXTURES WITH 10-BIT COLOR DEPTH
- PRELIMINARY AND SUBSEQUENT FILTERING OF NOISE IN IMAGE RESTORATION ALGORITHMS
- RELIABILITY ESTIMATION FOR SCREEN REPRODUCTION OF SATURATED PIGMENTS
- OPERATIONAL CHARACTERISTICS OF INFORMATION SYSTEM SECURITY THREATS RISK
- ANONYMOUS WEBSITE USER IDENTIFICATION BASED ON COMBINED FEATURE SET (WRITING STYLE AND TECHNICAL FEATURES)
- CALCULATION METHODS FOR IRRADIANCE COEFFICIENTS OF CYLINDRICAL SPACE OBJECT BY THE EARTH RADIATION
- OFF-LINE INTERACTION OF THE NONLINEAR DYNAMIC SYSTEMS
- MODERN STATE AND DEVELOPMENT PROSPECTS OF THE BASIC CONCEPTS IN THE FIELD OF MECHATRONICS
- APPLICATION EXPERIENCE AND PROSPECTS OF DIAMOND MICRO-TURNING TECHNOLOGY
- STRAIGHT COGS FORMATION FEATURES FOR CYLINDRICAL SPUR GEARS BY STEPPED GEAR-SHAPED CUTTER
- TEMPERATURE DEPENDENCE CONSIDERATION ISSUE FOR COEFFICIENT OF VOLUMETRIC HEAT CAPACITY IN SIMULATION OF LASER-ARC PAD WELD PROCESS
- METHODS OF TEMPERATURE FIELD MODELING FOR CONTACTLESS LASER DEFORMATION OF A PLATE
- DETACHED-EDDY SIMULATION OF TURBULENT AIRFLOW
- MARKET-MAKING STRATEGY IN THE SYSTEM OF ALGORITHMIC HIGH-FREQUENCY TRADING
- FROM TRADITIONAL DISTANCE LEARNING TO MASS ONLINE OPEN COURSES
- ОБРАТИМАЯ ФОТОДЕСТРУКЦИЯ НАНОЧАСТИЦ СЕРЕБРА В ФОТО-ТЕРМО-РЕФРАКТИВНЫХ СТЕКЛАХ
- EQUATIONS OF RADIATION TRANSFER IN INFRARED TOMOGRAPHY IN THE CASE OF ACTIVE-PASSIVE DIAGNOSIS AND SWEEPING SCANNING
- REAL TIME REGISTRATION OF THE ELECTROPHYSIOLOGICAL SIGNALS SPECTRA