文件索引工具的实现

原文地址：文件索引工具的实现

Introduction

实现HashTable，并且用HashTable完成一个文件索引工具。

HashTable

OBJECTIVES

Open and manipulate files
Implement a hashtable as part of your data structure library
Index files
Get familiar with command line parameters

SUMMARY

This week we're building a hashtable, then using the hashtable to index the files. After you complete your hashtable implementation, you'll write a piece of code that will traverse a file directory indexing the files. Then, you'll read each of those files and index each row of the file in another hashtable. Finally, you'll output the indexed data to a file.

HOW DOES THIS ALL FIT TOGETHER?

In the bigger scheme of things, we're putting pieces together to build a search engine. A search engine has a few components: a file crawler, a file processor, an indexer, and a query processor. The file crawler starts in one place, and traverses to find other files. On the web, this means starting at one page and following all the links on the page; in a file system, it means starting in one directory and traversing through the directory structure until all files have been foun. The file processor takes each of those files and finds significant words or concepts in the file. The indexer creates a data structure that (essentially) populates a data structure that makes it easy to find documents that have similar topics, or find documents that contain a certain word. The query processor lets us work with that data structure to get the results based on what we ask for.

PART 1: BUILD A HASHTABLE

First step is to build a hashtable. This is similar to the process we went through with LinkedList. You are once again provided with some starting files. See the a7 folder for the files.

Once again, you're provided with an example of how to use the hashtable, as well as example binaries that work so you can compare your output.

All of this should be in a htll folder in your a7 directory.

A few notes about hashtables:

Here's a set of slides explaining about hashtables.
Aspnes (Yale) also has some info about C and hashtables.
Hashtables need a key to map to a value. You get to choose what this value is. In this implementation, it must be an unsigned int. You'll see in the example hashtable usage, we compute a FNV hash value of a key, then use that to put a new value in the hashtable.

PART 2: FILE PROCESSING

Now, we're going to read in some movie info.

The program takes in a flag as an argument and outputs the results in a given order. The file will be in CSV format (comma-separated values). For this assignment, the format is such:
- represents an null value. The genre field is a comma-seperated list of genres: it should be parsed, and when indexing via genre, the movie should be considered as all genres.


id|type|title1|title2|_|year|_|duration|genre

A few rows:


tt7356766|tvEpisode|Walker Cup|Walker Cup|0|2017|-|-|Sport
tt7356768|tvSeries|Aaron Olivo's Skits|Aaron Olivo's Skits|0|2017|-|-|Comedy
tt7356770|short|Ducko|Ducko|0|2018|-|1|Comedy,Short
tt7356774|tvEpisode|Kitchen Disaster to Master|Kitchen Disaster to Master|0|2017|-|-|Comedy,Game-Show,Reality-TV
tt7356776|tvEpisode|Carolina Panthers at San Francisco 49ers|Carolina Panthers at San Francisco 49ers|0|2017|-|-|Sport
tt7358226|tvEpisode|Camra invisible|Camra invisible|0|1994|-|-|Adventure,Family,Thriller

The program accepts three different flags: -t, -g, -y.

The -t flag will index the movies based on type
The -g flag will index the movies based on genre
The -y flag will index the movies based on year
Only allow one flag at a time. It should be an error if more than one is provided.
If no flags are set, index by genre.
C has a tokenizer functionality. That might help process each line for you.

A sample invocation:


% ./index_movies -t data/test
Output to file_index.csv

The program is started for you, with the following structure:

Starting code is in the resources repo, a7 directory.

The README.md file has the steps you need to complete the code.

Some info on how to use command line arguments .

PART 3: OUTPUT THE INDEXED FILES TO AN "INDEX" FILE

After you've built the index, use the hashtable/index to generate a report of the indexed files.

Create a new file called indexed_movies.txt. Iterate through your hashtable, printing out each MovieSet in a format similar to below:


Genre: Crime
3 items
  The Shawshank Redemption
  The Usual Suspects
  American History X
Genre: Adventure
2 items
  The Lord of the Rings: The Fellowship of the Ring
  Back to the Future
...


Type: Short
3 items
  The Shawshank Redemption
  The Usual Suspects
  American History X
Type: tvEpisode
2 items
  The Lord of the Rings: The Fellowship of the Ring
  Back to the Future
Type: movie
5 items
...

Print out the indexed type/value on one line, then the number of items in that value, then the movie name for each movie in that genre.

PART 4: (OPTIONAL) SORTING THE OUTPUT

If you'd like to take the program one step further, you can output the movies in a sorted manner, utilizing C's qsort function and a custom-built comparator.

As a suggestion, consider creating a struct to store each row. Also, look at the qsort() function in C. Specifically, look at how you can pass a custom comparator function into it that will let you dictate how values are sorted.

SUBMISSION

Submit your assignment by pushing your code to Github. Tag it with the tag a7-final.

Tag your last commit, and submit the tag on Blackboard by the due date, midnight (Pacific Time).

Things to remember

Make sure your code compiles and runs; even if it crashes or isn't complete. Comment what you need to in order to make sure it compiles.
Unable to compile = losing 50% of your grade.
Commit early and often.
If you mistakenly tag your repo as a7-final and find yourself having to create a new final tag, delete the tag first. We don't want to figure out which of a7-final, a7-final-a, a7-final2, ... is the one we should be grading.
If there's no tag, automatic 0.
Valgrind should report that there are no memory leaks (all heap has been freed) as well as NO ERRORS.
If you have issues (you got sick, changed jobs, moved, ...) that impact your ability to finish on time, TELL US BEFORE THE DEADLINE. You will lose at least 50% of your points if you don't talk to us first

(本文出自csprojectedu.com，转载请注明出处)

文件索引工具的实现

Introduction

OBJECTIVES

SUMMARY

HOW DOES THIS ALL FIT TOGETHER?

PART 1: BUILD A HASHTABLE

PART 2: FILE PROCESSING

PART 3: OUTPUT THE INDEXED FILES TO AN "INDEX" FILE

PART 4: (OPTIONAL) SORTING THE OUTPUT

SUBMISSION

Things to remember

csprojectedu

引用和评论

3-Tier网络的开发

【深度剖析】MySQL B+树索引：一次查询的内部奥秘

深入理解 StarRocks Bitmap 索引和 Bitmap 去重

如何合理规划Elasticsearch的索引｜得物技术