پیکره متنی بی جن خان

دکتر محمود بی جن خان

پيكره بي‌جن‌خان، در آزمايشگاه زبان‌شناسي دانشگاه تهران نگهداري مي‌شود. اين پيكره، از برخي اخبار روزنامه‌ها و متون معمولي جمع‌آوري شده است. يكي از ويژگي‌هاي اين پيكره اين است كه هر سند در اين مجموعه داراي يك عنوان مي‌باشد. به عنوان مثال، اسناد تحت عناوين( سياسي، فرهنگي، اقتصادي)دسته‌بندي شده‌اند.در اين پيكره 4300 عنوان مختلف وجود دارد. اين عنوان‌ها يك محيط آزمايشي مورد دلخواه براي خوشه‌بندي و مقوله‌بندي و غيره را توليد مي‌كند. اين پيكره شامل 2598215 واژه و 550 برچسب مي‌باشد كه به طور دستي برچسب زده شده است. در عمليات برچسب‌زني از عناوين متون صرف نظر شده است. زيرا هدف، بدست آوردن يك نرم‌افزار برچسب‌زننده خودكار است.

اجزاي تشكيل‌دهنده برچسب‌ها در پيكره بي‌جن‌ خان

هر برچسب در اين مجموعه از يك ساختار سلسله مراتبي پيروي مي‌كند. بخشهايي از نام برچسب كه در ابتداي نام آن قرار دارند، بيان‌كننده توصيف كلي‌تري از آن برچسب مي‌باشند. در ابتداي برچسب مقوله‌هاي اصلي مشخص مي‌شوند، بخشهايي كه در انتهاي نام برچسب قراردارند، توصيف جزئي‌تر در مورد آن برچسب هستند. يعني ساير ويژگي‌هاي مقوله‌هاي اصلي قرار مي‌گيرند. مثلا برچسب N_PL_LOC داراي سه سطح در ساختار سلسله مراتبي مي‌باشد. سطح اول N مشخص كننده اسم مي‌باشد. سطح دوم PL مشخص‌كننده نوع جمع مي‌باشد و سطح سوم LOCمشخص‌كننده مكان مي‌باشد.

ویژگي‌هايي كه براي هرمقوله اصلي ‌مي‌توان برشمرد شامل موارد زير مي‌باشند:

ويژگي اسم: شمار( جمع و مفرد)، خاص و عام بودن، معرفه و نكره، اسم جنس، زمان،ظرف، فصل، لقب، ماه، مكان، گروه، جهت، مصدر،

ويژگي فعل: شخص، زمان، گذر، معلوم، مجهول، وجه، نمود

ويژگي صفت: ترتيبي، تفضيلي، ساده، مركب،عالي، مفعولي،

ویژگي قيد: پرسشي، تاسف و تعجب، زمان، تكرار، تمنا، عير پرسشي، تفضيلي، كمي، ساده، مركب، مثال، نفي، مكان،

ويژگي‌هاي مقوله‌هاي فرعي: حرف ربط (اسمي، پيش‌مصدري، متمم‌ساز كلي، مقايسه، موصولي، همپايه‌ساز)، حرف صوت، حرف ندا، سور، كيفيت‌نما، ضمير(معرفه، نكره، انعكاسي، مفعولي، مفرد،جمع) علامت رياضي، عربي، گروه حرف اضافه، ادات شرط.( تمام ويژگي‌هاي مقوله‌هاي اصلي و فرعي كليه نمادهاي موجود در متن در پيكره بي‌جن‌خان در ضميمه‌ آورده شده است.)

برچسب‌هاي متفاوتي كه يك كلمه در پيكره مي‌گيرد، نشان‌دهنده نقش متفاوت كلمات در زبان فارسي است. مثلا اگر كلمه‌اي در پيكره 2568 بار در پيكره تكرار شده باشد و يك برچسب داشته باشد، نتيجه اين كه يك نقش دارد اما كلمه‌اي ممكن است يك بار تكرار شود و ده برچسب داشته باشد، يعني ده نقش دارد. مثلا كلمه(آسمان) در كل پيكره همواره برچسب N_SING را گرفته است(هميشه اسم است). در حالي كه كلمه (بالا) برچسب‌هاي متفاوتي را در شرايط متفاوت گرفته است.

اکثر واژه ها (91 درصد) فقط یک برچسب دارند اما بعضی ار واژه های متن بسته به مکان قرار گرفتنشان در متن بیش از یک برچسب دارند.

Welcome to website of Bijankhan corpus

What is Bijankhan Corpus?

Bijankhan corpus is a tagged corpus that is suitable for natural language processing research on the Persian (Farsi) language. This collection is gathered form daily news and common texts. In this collection all documents are categorized into different subjects such as political, cultural and so on. Totally, there are 4300 different subjects. The Bijankhan collection contains about 2.6 millions manually tagged words with a tag set that contains 40 Persian POS tags. This collection is prepared and distributed by database research group at University of Tehran. We are indebted to Prof. M.Bijankhan from faculty of Literature & Human Science at University of Tehran because of his invaluable works on the original version of the corpus, so we named this corpus after him.
Moreover, we recommend you to visit web site of Hamshahri corpus that is more suitable for information retrieval researches.

Copyright

Bijankhan corpus was created in DBRG Lab. at University of Tehran – ECE department. All rights of this corpus and the tools that are included in this package are reserved for University of Tehran – Database Research Group. Usage of this package for any research or non-commercial purposes is free with the precondition that you cite the related papers below.

This Package’s components

Bijankhan processed corpus (149 MB)
Bijankhan original corpus (50.3 MB)
Distinct words of Bijankhan corpus (76707 words in unicode text format)
Five random training and test sets (85% training, 15% test) of the corpus that are used in the following papers.
Source codes of the POS taggers that we used.
Published papers and presentations.

Downloads

	Files	Description
1		*Processed corpus (11.1 MB):* This file is a compressed version of the whole corpus in Unicode text format. This file contains a version of Bijankhan corpus that is processed to be more suitable for NLP tasks according to [1]. It contains nearly 2.6 million tagged words. To download a sample of the corpus click here. Also click here to see tagset description of the corpus.
2		*Original corpus (3.7 MB):* This file is a compressed version of the whole corpus in LBL text format. This file contains the original Bijankhan corpus without any changes that was manually tagged and prepared at Research Center of Intelligent Signal Processing (RCISP). Its tag set contains 550 tags and totally it contains 4300 subject categories.
3		*The corpus distinct words (256 KB):* This compressed file is unicode text file that contains 76707 distinct word of the Bijankhan corpus.
4		*Training and test sets(will be added soon):* This compressed file contains five diffrent pairs of training and test sets that are created randomly from the Bijankhan corpus. Each training part consists 85% of the corpus and each test part consists 15% of the corpus. For more information please refer [1].
5		*MLE Tagger (53.4 KB):* This file contains C# source code of Maximum Likelihood Estimation (MLE) tagger that we implemented and used in our studies. Also it contains a demo that shows how to use the program.
6		*TnT tagger :* In order to prepare a TnT tagger please refer to web site of the TnT: Statistical Part-of-Speech Tagging.
7		*MBT Tagger:* An open source version of Memory Based POS Tagger (MBT) can be found in this web site.
8		*Corpus Words (574 KB):* This file contains all words of the corpus and their frequencies.

Published Papers:

	Reference	PDF	Power Point	Description
[1]	Hadi Amiri, Hosein Hojjat, Farhad Oroumchian.Investigation on a Feasible Corpus for Persian POS Tagging. 12th international CSI computer conference, Iran, 2007.			This paper reports creation of test corpus of automatic part of speech tagging purposes based on the Persian tagged corpus of Prof. Bijankhan and includes preprocessing, statistical analysis and experiments with simple statistical POS tagging method, MLE, done on this corpus.
[2]	Farhad Oroumchian, Samira Tasharofi, Hadi Amiri, Hossein Hojjat, Fahime Raja. Creating a Feasible Corpus for Persian POS Tagging. Technical Report, no. TR3/06, University of Wollongong in Dubai, 2006.			This technical report contains a very through analysis and report of the creation of the Bijankhan corpus.
[3]	Samira Tasharofi, Fahimeh Raja, Farhad Oroumchian, Masoud Rahgozar. Evaluation of Statistical Part of Speech Tagging of Persian Text. International Symposium on Signal Processing and its Applications, Sharjah, (U.A.E.), 2007.			This paper study the performance of one of the popular POS taggers namely TnT tagger on the Bijankhan corpus. TNT tagger was shown to have high accuracy in English and some other languages, this paper shows this tagger provides high accuracy in Persian too.
[4]	Fahimeh Raja, Hadi Amiri, Samira Tasharofi, Hossein Hojjat, Farhad Oroumchian. Evaluation of part of speech tagging on Persian text. The Second Workshop on Computational aproaches to Arabic Script-based Languages, Linguistic Institute Stanford University, 2007.			This paper compares the accuracy of three different POS taggers, MLE, MBT and TNT on the Bijankhan corpus and demonstrate the value of simple heuristics and post-processing in improving the accuracy of these methods.
[5]	Abolfazl Aleahmad, Yoosef Ramezani, Farhad Oroumchian.Using OWA for Persian Part of Speech Tagging. Novemner 2006.			In this study we used OWA method to fuse the result of three different POS tagging systems, namely MLE (Maximum Likelihood Estimation), TnT tagger and PTT (Persian Tree Tagger).
[6]	Hadi Amiri, Persian(Farsi) POS tagging, presented in NLP course on 7 November 2006.
[7]	Mostafa Keikha, Persian(Farsi) POS tagging, presented in NLP course on 7 November 2006.

Contact Information:

Please feel free to contact us if you have any question:

	Name	Email	Subject
1	Hadi Amiri	h.amiri@ece.ut.ac.ir	The corpus, its statistics and POS taggers
3	Abolfazl AleAhmad	a.aleahmad@ece.ut.ac.ir	The corpus, its statistics and POS taggers

کاربر گرامی

برای دانلود فایل های مورد نظرتان بایستی بر روی دکمه "افزودن به سبد خرید" کلیک نمایید .

پس از چند ثانیه ، فایل مورد نظر شما به سبد خریدتان اضافه گردیده و این دکمه تبدیل به دکمه "پرداخت" خواهد شد.

با کلیلک بر روی دکمه "پرداخت" ، وارد صفحه پرداخت خواهید شد .

با وارد کردن اطلاعات و ایمیل خود ، فایل مورد نظر به ایمیل شما ارسال گردیده و همچنین لینک دانلود فایل بلافاصله برایتان به نمایش درخواهد آمد.

– قابل پرداخت با تمام کارتهای بانکی + رمز دوم

– پشتیبانی سایت ۰۹۳۵۹۵۲۹۰۵۸ – Info@tnt3.ir – universitydatainfo@yahoo.com