lexiDB: مجذور مقیاس پذیر سیستم مدیریت پایگاه داده / lexiDB: A Scalable Corpus Database Management System

lexiDB: مجذور مقیاس پذیر سیستم مدیریت پایگاه داده lexiDB: A Scalable Corpus Database Management System

  • نوع فایل : کتاب
  • زبان : انگلیسی
  • ناشر : IEEE
  • چاپ و سال / کشور: 2018

توضیحات

رشته های مرتبط مهندسی کامپیوتر
گرایش های مرتبط نرم افزار
مجله کنفرانس بین المللی کلان داده – International Conference on Big Data


منتشر شده در نشریه IEEE

Description

I. INTRODUCTION Corpora utilised by corpus linguists have steadily grown in scale and complexity over the last fifty years. Beginning with relatively small corpora (although they were considered large at the time) of one million words such as Brown [5] in the 1960s, the size of corpora has been increasing by an order of magnitude roughly every 10 years. In the 1990s the British National Corpus (BNC)1 was created with one hundred million words and now corpora of interest to linguists order in the billions of words with Historical Hansard2 and Early English Books Online (EEBO)3 being prime examples. In parallel to this growth in size of the raw text used in corpora so too has there been an increase in the number of levels of annotation attached to such corpora. Beginning with simple part-of-speech (POS) tagging and lemmatisation, linguists now utilise more advanced annotation such as dependency parsing, semantic tags and historical spelling variants when conducting corpus analysis. This motivates the need for retrieval software and tools that are capable of supporting annotated corpus data at this scale and complexity. In big data terms, increasing the ‘volume’ of corpora provides greater numbers of examples for mid to low frequency words and linguistic features which is important for analysis purposes. The ‘variety’ of data included within a corpus is also important to achieve for improved representativeness and coverage of the types of language being studied. In this paper, we address issues of ‘velocity’ (the application of parallel or distributed methods), consideration of which is vitally important since the current crop of corpus linguistics retrieval tools are struggling to cope with the ever increasing scale of corpora. Typically corpus linguists rely on five main retrieval methods in order to perform their analysis: concordances, collocations, clusters (n-grams), keyword lists and frequency lists. Whilst other more complex forms of analysis exist they often are built on top of one or more of these basic methods or are sometimes subtle variations of such queries. These query types are generally not fully or efficiently supported by traditional DBMSs (Database Management Systems) or IR (Information Retrieval) systems as shown in previous work [1]. Some systems have limited support for keyword in context search (concordances) but in order to support these query types fully, corpus linguists must usually rely on a tool built on top of an existing retrieval or database system. Software that can be used locally on desktop PCs are sometimes favored by linguists as they allow them the flexibility to use their own corpora and to perform analysis without reliance on anything more than a laptop. Tools such as WordSmith4 and AntConc5 allow users to perform corpus queries such as concordances and to generate frequency lists. However these tools lack support for larger billion word scale corpora. Other server based tools exist such as Wmatrix [8], CQPweb [3], SketchEngine6, KorAP [2] and corpus.byu7. Often these tools are based on Open Corpus Workbench (CWB)8, existing relational DBMSs such as MySQL9 or text indexers such as Lucene10. These systems handle corpora of larger scale better but are limited relative to the flexibility of local tools as linguists often cannot add their own corpora or annotation or are restricted in the size of corpora that can be added.
اگر شما نسبت به این اثر یا عنوان محق هستید، لطفا از طریق "بخش تماس با ما" با ما تماس بگیرید و برای اطلاعات بیشتر، صفحه قوانین و مقررات را مطالعه نمایید.

دیدگاه کاربران


لطفا در این قسمت فقط نظر شخصی در مورد این عنوان را وارد نمایید و در صورتیکه مشکلی با دانلود یا استفاده از این فایل دارید در صفحه کاربری تیکت ثبت کنید.

بارگزاری