University of MacauCISC7021 - Applied Natural Language ProcessingAssignment 1, 2023/2024(Due date: 26 September 2023)

1. Introduction

In this assignment, we will prepare -gram language models and evaluate the test set'sperplexity. We will learn how to create a language model using the language modeltoolkit SRILM 1 (Stolcke, 2002). The toolkit can be downloaded at:http://www.speech.sri.com/projects/srilm/download.html. Basicinstructions on usingthe SRILM toolkit can be found on the website also.Train and Test DataThe training and testing data for this assignment come from the News Commentary,which is created to be used for training the English language model. The training dataconsists of 300 thousand lines of text. While the testing setconsists of around 90thousand lines of text. The data corpora are from the official website of Shared Task:Machine Translation of News.
2 Both the training and testing data can be downloadedfrom UMMoodle.

2.Tasks

  1. Build word-based language models, 1-gram, 2-gram, and 3-gram, for English textgiven the training data, and measure the perplexity on the training and testing set.
  2. Build character-based language models, 1-gram to 6-gram, using the training dataand measuring the perplexity of the training and test set.
  3. Collect more monolingual data from the First Conference on Machine Translation(WMT16) and add them to the training data. Build language models and measurethe perplexity.

3.Environment Setup

We require all the related (development) tools for course assignments and projects areLinux/Unix programs. You need to have a Linux platform for conducting experimentsand system implementation. Using a virtual machine (i.e. WM Virtual Box -https://www.virtualbox.org/) to host a Linux system (i.e. Ubuntu -http://www.ubuntu.com/) will be a good choice. We strongly recommend this. Besides,you will use different toolkits for various (pre)processing tasks in the coursework. Forexample, you need a g++ compiler for compiling theSRILM toolkit in this assignment.
1 http://www.speech.sri.com/projects/srilm/download.html
2 http://www.statmt.org/wmt16/translation-task.html
In any way, there are documents for using the toolkit. If you are new to processing texton the Linux platform, there is a very good introduction given by Church (1994)3 ofusing Unix commands for basic text processing.

4. Report

You need to submit a report of your work (2~3 pages). It should clearly present what isgoing on in your experiments, how you achieve them, and solve problems youencountered. You should include tables (or graphs) of the data (e.g. corpora statistics),evaluated perplexities, etc. of your models. I am particularly interested to see theconclusions you draw about the models you made and the data you collected, as wellas the analysis of the obtained results. The report should follow thetwo-column formatof the ACL proceeding.
WX:codehelp


追风的电脑桌_XPdvn
1 声望0 粉丝