Parse arxiv Metadata

Original address: Analysis of arxiv Metadata

Introduction

Goals of this Assignment

In this assignment, you will practise working with files, building and using dictionaries, designing functions using the Function Design Recipe, and writing unit tests.

Introduction: arxiv.org

arxiv.org (https://arxiv.org/) is a free distribution service and an open-access archive for nearly two million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. arXiv is pronounced as "archive" (https://en.wikipedia.org/wiki/ArXiv) .

arxiv.org (https://arxiv.org/) maintainers believe in open, free, and accessible information. In addition to free and easy access to the articles themselves, arxiv.org also provides ways to access its metadata (https://arxiv.org/help/bulk_data) . This metadata includes information such as the article's unique identification number, author(s), title, abstract, the date the article was added to the arxiv and when it was last modified, licence under which the article was published, etc. This metadata is used by a variety of research tools that investigate scientific research trends, provide intelligent scientific search techniques, and in many other areas.

To make this assignment more manageable for you, we have extracted a sample of arxiv's metadata, simplified it, and created a text file you will use as input to your program.

The Metadata File

The metadata file contains a series of one or more article descriptions, one after the other. Each article description has the following elements, in order:

A line containing a unique identifier of the article. An identifier will not occur more than once in the file, and it does not contain any whitespace.
A line containing the article's title. If we do not have title information, this line will be blank.
A line containing the date the article was created, or a blank line if this information is not provided. The date is formatted YYYYMM-DD.
A line containing the date the article was last modified, or a blank line if this information is not provided. The date is formatted YYYY-MM-DD.
Zero or more lines with the article's author(s). Each line contains an author's last name followed by a comma , followed by the author's first name(s). There is always exactly one comma , on the author line. Note, that there may be white space and/or punctuation other than commas included in an author's last name and/or first name. Immediately after the zero or more author lines, the file contains a single blank line. If we do not have any author information for an article, then the blank line will come immediately after the modification date line.
Zero or more lines of text containing the abstract of the article. Immediately after the abstract, the file contains a line with the keyword END on it (and nothing else other than perhaps whitespace). You may assume that a line with only END in it does not occur in any other context in the metadata file, i.e. it always signifies that an article description is over. Unless this is the last line in the file, the next line will contain the identifier of the next article, and so on.

You can assume that any file we test your code with has this structure. You do not need to handle any invalid file formats.

Example Metadata File

Here is an example metadata file (also provided in the starter code later (a3.zip) ):


008
Intro to CS is the Best Course Ever
2021-09-01

Ponce,Marcelo
Tafliovich,Anya Y.

We present clear evidence that Introduction to
Computer Science is the best course.
END
031
Calculus is the Best Course Ever

2021-09-02
Breuss,Nataliya

We discuss the reasons why Calculus I
is the best course.
END
067
Discrete Mathematics is the Best Course Ever
2021-09-02
2021-10-01
Pancer,Richard
Bretscher,Anna

We explain why Discrete Mathematics is the best course of all times.
END
827
University of Toronto is the Best University
2021-08-20
2021-10-02
Ponce,Marcelo
Bretscher,Anna
Tafliovich,Anya Y.

We show a formal proof that the University of
Toronto is the best university.
END
042

2021-05-04
2021-05-05
This is a very strange article with no title
and no authors.
END

This metadata file contains information on five articles with unique identifiers '008' , '031' , '067' , '827' , and '042' . Notice that the following information is not provided in the file: modified date in article '008' , created date in article '031' , and title and authors in article '042' . All these are valid cases and your code should deal with them. Also notice that an abstract can occupy zero or one or more lines in the input file.

Storing the Arxiv Metadata

We will use a dictionary to maintain the arxiv metadata. Let us look in detail at the format of this dictionary. The types below are defined in constants.py and we have imported them into arxiv_functions.py for use in your type contracts.

Type NameType

We will store the names of authors as tuples of two strings: the author's last name(s) and the author's first name(s). For example, the author Anna Bretscher would be listed in the metadata file as 'Bretscher,Anna' and will be stored as ('Bretscher', 'Anna') . Note, that there may be punctuation and/or white space included in an author's last name and/or first name, and we need to keep all this information. The only exception is: there are no commas in author's first nor last names. For example, Tafliovich,Anya Y. , Van Dyke,Mary-Ellen and Sklodowska Curie,Marie Salomea are all valid input lines, and should be stored as ('Tafliovich', 'Anya Y.') , ('Van Dyke', 'Mary-Ellen') and ('Sklodowska Curie', 'Marie Salomea') , respectively. A line like Tafliovich,Anya,Y. is not valid, since it contains two commas and we cannot tell which is supposed to be the first and which is the last name. You will only have to deal with valid input in this assignment. We will also make the simplification that all authors will have both first and last names.

Type ArticleType

The file constants.py in the starter code defines the following constants that you should use instead of the literal strings. Below are the current values of the constants.


ID = 'identifier'
TITLE = 'title'
CREATED = 'created'
MODIFIED = 'modified'
AUTHORS = 'authors'
ABSTRACT = 'abstract'

We will store information about a single article in a dictionary that maps ID , TITLE , CREATED , MODIFIED , AUTHORS , and ABSTRACT to the corresponding values. The value for each piece of information is of type str , except for the value associated with key AUTHORS , which is a List of NameType . If an element is not provided in the metadata file, then the value associated with that key will be empty (i.e. the empty string, or in the case of no authors, an empty list). The list of authors will be sorted in lexicographic order. (removed Nov 12)

For example, the article with the identifier '008' in our example input file above will be stored in the following dictionary:


{ID: '008',
TITLE: 'Intro to CS is the Best Course Ever',
CREATED: '2021-09-01',
MODIFIED: '',
AUTHORS: [('Ponce', 'Marcelo'), ('Tafliovich', 'Anya Y.')],
ABSTRACT: 'We present clear evidence that Introduction to\nComputer Science is the best course.'}

Notice that since the fourth line in the specification is blank, the value corresponding to key MODIFIED is the empty string. Also notice that the final newline character on each line is not included in any of the stored values, except for the newline characters inside the abstract we keep those! Take a careful look at the starter file example_data.txt (same as the example above) and the corresponding dictionary EXAMPLE_ARXIV defined in the file arxiv_functions.py for more examples.

Type ArxivType

Finally, we will store the entire arxiv metadata in a dictionary that maps article identifiers to articles, i.e. to values of type
ArticleType . The key/value pair in this dictionary that corresponds to the above article is:


'008': {
    ID: '008',
    TITLE: 'Intro to CS is the Best Course Ever',
    CREATED: '2021-09-01',
    MODIFIED: '',
    AUTHORS: [('Ponce', 'Marcelo'), ('Tafliovich', 'Anya Y.')],
    ABSTRACT: 'We present clear evidence that Introduction to\nComputer Science is the best course.'
}

The diagram below shows a picture of the dictionary that represents some of the articles in the example_data.txt file using the constants provided in constants.py .

Required Functions

In the starter code file arxiv_functions.py , follow the Function Design Recipe to complete the following functions. In addition, you must add some helper functions (i.e. functions that you design yourself) to aid with the implementation of these required functions. Helper functions also require complete docstrings. We strongly recommend you also use the suggested helper functions in the table below; we give you these hints to make your programming task easier.

Some indicators that you should consider writing a new helper function, or using something you've already written as a helper are:

Rewriting code to solve a task you have already solved in another function
Getting a warning from the checker that your function is too long
Getting a warning from the checker that your function has too many nested blocks or too many branches
Realizing that your function can be broken down into smaller sub-problems (with a helper function for each)

For each of the functions below, other than read_arxiv_file , write at least two examples that use the constant EXAMPLE_ARXIV . If your helper function takes an open file as an argument, you do NOT need to write any examples in that function's docstring. Otherwise, for any helper functions you add, write at least two examples in the docstring.

Your functions should not mutate their arguments, unless the description says that is what they do.

A note on sorting: Throughout the assignment, we ask for lists to be sorted in lexiocographic order. This is the order that Python sorts in (such as when you call list.sort ). You do not have to write your own sorting code (unless you want to!)

We have broken the components of the assignment down into five Tasks, grouping related functions together. Some tasks are easier than others, and you can do the tasks in any order. As in the previous assignments, we'll be marking each function mostly separately (however there will be some overlap when functions call other functions).

(This article is from csprojectedu.com , please indicate the source for reprinting)