CISC7021应肜自然语言处理

University of MacauCISC7021 - Applied Natural Language ProcessingAssignment 1, 2023/2024(Due date: 26 September 2023)

1. Introduction

In this assignment, we will prepare -gram language models and evaluate the test set'sperplexity. We will learn how to create a language model using the language modeltoolkit SRILM 1 (Stolcke, 2002). The toolkit can be downloaded at:http://www.speech.sri.com/projects/srilm/download.html. Basicinstructions on usingthe SRILM toolkit can be found on the website also.Train and Test DataThe training and testing data for this assignment come from the News Commentary,which is created to be used for training the English language model. The training dataconsists of 300 thousand lines of text. While the testing setconsists of around 90thousand lines of text. The data corpora are from the official website of Shared Task:Machine Translation of News.
2 Both the training and testing data can be downloadedfrom UMMoodle.

2.Tasks

Build word-based language models, 1-gram, 2-gram, and 3-gram, for English textgiven the training data, and measure the perplexity on the training and testing set.
Build character-based language models, 1-gram to 6-gram, using the training dataand measuring the perplexity of the training and test set.
Collect more monolingual data from the First Conference on Machine Translation(WMT16) and add them to the training data. Build language models and measurethe perplexity.

3.Environment Setup

We require all the related (development) tools for course assignments and projects areLinux/Unix programs. You need to have a Linux platform for conducting experimentsand system implementation. Using a virtual machine (i.e. WM Virtual Box -https://www.virtualbox.org/) to host a Linux system (i.e. Ubuntu -http://www.ubuntu.com/) will be a good choice. We strongly recommend this. Besides,you will use different toolkits for various (pre)processing tasks in the coursework. Forexample, you need a g++ compiler for compiling theSRILM toolkit in this assignment.
1 http://www.speech.sri.com/projects/srilm/download.html
2 http://www.statmt.org/wmt16/translation-task.html
In any way, there are documents for using the toolkit. If you are new to processing texton the Linux platform, there is a very good introduction given by Church (1994)3 ofusing Unix commands for basic text processing.

4. Report

You need to submit a report of your work (2~3 pages). It should clearly present what isgoing on in your experiments, how you achieve them, and solve problems youencountered. You should include tables (or graphs) of the data (e.g. corpora statistics),evaluated perplexities, etc. of your models. I am particularly interested to see theconclusions you draw about the models you made and the data you collected, as wellas the analysis of the obtained results. The report should follow thetwo-column formatof the ACL proceeding.
WX：codehelp

CISC7021应肜自然语言处理

1. Introduction

2.Tasks

3.Environment Setup

4. Report

追风的电脑桌_XPdvn

引用和评论

CIS5200机器学习

MyBatis-Plus结合Spring Boot实现数据权限

70k star，取代Postman！这款轻量级API工具，太香了！

大模型时代，后端程序员如何避免被AI卷死？

C++ 中 VS 项目引入公共配置文件

LSM-TREE从入门到入魔：从零开始实现一个高性能键值存储｜得物技术

疯狂推荐！从零开始 Dify 部署全攻略！

CISC7021应肜自然语言处理

1. Introduction

2.Tasks

3.Environment Setup

4. Report

追风的电脑桌_XPdvn

引用和评论

CIS5200机器学习

MyBatis-Plus结合Spring Boot实现数据权限

70k star，取代Postman！这款轻量级API工具，太香了！

大模型时代，后端程序员如何避免被AI卷死？

C++ 中 VS 项目引入公共配置文件

LSM-TREE从入门到入魔：从零开始实现一个高性能键值存储 ｜ 得物技术

疯狂推荐！从零开始 Dify 部署全攻略！

LSM-TREE从入门到入魔：从零开始实现一个高性能键值存储｜得物技术