Open Access System for Information Sharing

Login Library

 

Article
Cited 2 time in webofscience Cited 4 time in scopus
Metadata Downloads
Full metadata record
Files in This Item:
There are no files associated with this item.
DC FieldValueLanguage
dc.contributor.authorShim, J-
dc.contributor.authorKim, D-
dc.contributor.authorCha, J-
dc.contributor.authorLee, GG-
dc.contributor.authorSeo, J-
dc.date.accessioned2016-03-31T13:06:43Z-
dc.date.available2016-03-31T13:06:43Z-
dc.date.created2009-03-18-
dc.date.issued2002-07-
dc.identifier.issn0306-4573-
dc.identifier.other2002-OAK-0000002620-
dc.identifier.urihttps://oasis.postech.ac.kr/handle/2014.oak/19095-
dc.description.abstractMost work in NLP requires that texts have been previously segmented into sentences and words. Segmenting a text into sentences and words, however, is a complex task, due to the ambiguity of many punctuation marks and spaces. Furthermore, Web texts such as HTML documents are more difficult to make into well refined and segmented texts because they are described in a more free style, with many sentence boundary and spacing errors. The objective of this paper introduces a multi-strategic integrated text preprocessing method for difficult problems of sentence boundary disambiguation and word boundary disambiguation of Web texts. We have applied a hybrid method (the regular expression rule, the heuristic rule, and the inductive learning of statistical decision trees, using a C4.5 learner) synergically to the task of raw corpus preprocessing. This work contributes to a more correct morphological analysis and guarantees a more stable working of application systems. We tackle easily definable problems with automatically acquired constraints and we use inductively learned decision trees to solve ill-defined ambiguity problems by incorporating multiple features (n-grams, relative frequency, entropy, tri-dictionary index). The multistrategy approach was thoroughly tested: it achieved approximately 99.12% (with punctuation marks) and 98.04% (without any punctuation marks) accuracy in sentence boundary disambiguation, 95.39% accuracy of word spacing correction, and 94.61% accuracy for whole intermixed text preprocessing problems. from Korean news script Web documents. (C) 2002 Elsevier Science Ltd. All rights reserved.-
dc.description.statementofresponsibilityX-
dc.languageEnglish-
dc.publisherPERGAMON-ELSEVIER SCIENCE LTD-
dc.relation.isPartOfINFORMATION PROCESSING & MANAGEMENT-
dc.subjecttext normalization-
dc.subjectsentence boundary disambiguation-
dc.subjectword boundary disambiguation-
dc.subjectspacing-word correction-
dc.titleIntegrated multi-strategic Web document pre-processing for sentence and word boundary detection-
dc.typeArticle-
dc.contributor.college컴퓨터공학과-
dc.identifier.doi10.1016/S0306-4573(01)00044-9-
dc.author.googleShim, J-
dc.author.googleKim, D-
dc.author.googleCha, J-
dc.author.googleLee, GG-
dc.author.googleSeo, J-
dc.relation.volume38-
dc.relation.issue4-
dc.relation.startpage509-
dc.relation.lastpage527-
dc.contributor.id10103841-
dc.relation.journalINFORMATION PROCESSING & MANAGEMENT-
dc.relation.indexSCI급, SCOPUS 등재논문-
dc.relation.sciSCIE-
dc.collections.nameJournal Papers-
dc.type.rimsART-
dc.identifier.bibliographicCitationINFORMATION PROCESSING & MANAGEMENT, v.38, no.4, pp.509 - 527-
dc.identifier.wosid000175479100004-
dc.date.tcdate2019-01-01-
dc.citation.endPage527-
dc.citation.number4-
dc.citation.startPage509-
dc.citation.titleINFORMATION PROCESSING & MANAGEMENT-
dc.citation.volume38-
dc.contributor.affiliatedAuthorLee, GG-
dc.identifier.scopusid2-s2.0-0036643016-
dc.description.journalClass1-
dc.description.journalClass1-
dc.description.wostc2-
dc.type.docTypeArticle-
dc.subject.keywordAuthortext normalization-
dc.subject.keywordAuthorsentence boundary disambiguation-
dc.subject.keywordAuthorword boundary disambiguation-
dc.subject.keywordAuthorspacing-word correction-
dc.relation.journalWebOfScienceCategoryComputer Science, Information Systems-
dc.relation.journalWebOfScienceCategoryInformation Science & Library Science-
dc.description.journalRegisteredClassscie-
dc.description.journalRegisteredClassscopus-
dc.relation.journalResearchAreaComputer Science-
dc.relation.journalResearchAreaInformation Science & Library Science-

qr_code

  • mendeley

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Views & Downloads

Browse