原文地址:使用Pandas库分析股票

Introduction

Pandas等三方库,根据Financial technology相关程序,进行股票分析,

Financial technology

Requirement

This assignment builds on Lectures 7 to 9 and on Tutorials 6 and 7. You might want to consider using some of the Python code discussed in those lectures and tutorials to answer some of the questions below.

Important: It is important that you do not change the type (markdwon vs. code) of any cell, nor copy/paste/duplicate any cell! If the cell type is markdown, you are supposed to write text, not code, and vice versa. Provide your answer to each question in the allocated cell. Do not create additional cells. Answers provided in any other cell will not be marked. Do not rename the assignment files. All files should be left as is in the assignment directory.

Task

You are given two datasets:

  1. A file called Assignment4-data.csv , that contains financial news (headlines) and daily returns for Apple (AAPL). Relying on this dataset, your role as a FinTech student is to explore the relationship between financial news and stock returns.
  2. A file called AAPL_returns.csv , that contains the daily returns for Apple (AAPL).

Helpful commands

You may find the following commands helpful to complete some of the questions.

  1. How to create a new column using data from existing column? Recall that, in

Tutorial 7, we worked with a variable called FSscore . Suppose we wanted to divide all the values of this variable by 100 and store the outcome in a new column. This can be done in one step. The code df['FSscore_scaled'] = df['FSscore']/100 creates a new column with the name FSscore_scaled and stores the modified values.

  1. How to separate a string variable into a list of strings? The method split() splits a string into a list based on a specified separator. The default separator is any white space. However, one can specify the applied separator as an argument. For example, the code "a,b,c".split(",") splits the string "a,b,c" into the list [a, b, c].
  2. You can use string functions such as split() on a Pandas dataframe column by using the str attribute. For example, df['alphabets'].str.split(",") returns a series (consider a series as a dataframe with one column) that contains a list obtained by running the split function on each entry in the column named alphabets .
  3. How to chain multiple string operations in Pandas ? Note that a string function on a Pandas column returns a series. One can then use another string function on this series to chain multiple operations. For example, the cell below first converts the string to upper case and then calls the split function.
  4. How to combine two or more data frames? For this purpose, one can use the concat function from Pandas . To combine the dataframes to match indices you can use the axis=1 argument. Please see https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html for examples.

Please run the following cell to import the required libraries and for string operations example.

In [1]:

## Execute this cell

####################### Package Setup ##########################

# Disable FutureWarning for better aesthetics.
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# essential libraries for this assignment
from finml import *
import numpy as np
import pandas as pd
%matplotlib inline

# for logistic regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

# suppress warnings for deprecated methods from TensorFlow
import tensorflow as tf
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

################################################################

# Example of string operations
import pandas as pd
example_data = {'alphabets':['a,b,c', 'd,e,f', 'a,z,x', 'a,s,p']}
example_df = pd.DataFrame(example_data)

# Chain two string operations
example_df['alphabets'].str.upper().str.split(",")

Out[1]:


0 [A, B, C]
1 [D, E, F]
2 [A, Z, X]
3 [A, S, P]
Name: alphabets, dtype: object

Data exploration and transformation

The dataset has the following three columns:

  1. date: This column contains the date of the observation.
  2. headlines: This column contains the concatenation of headlines for that date. The headlines are separated by the <end> string. For example, if there are three headlines h1 , h2 , and h3 on a given day, the headline cell for that day will be the string h1<end>h2<end>h3.
  3. returns: This column contains the daily returns.

In your assessment, please address the following questions.

Question 1

Load the dataset in a Pandas dataframe and write a Python code that plots the time series of the daily Apple returns (returns on the y-axis and dates on the x-axis). Make sure your plot's axes are appropriately labelled.

Note: Please use df as the variable name for the dataframe and the parse_dates argument to correctly parse the date column.

Answer 1

In [41]:

"""Write your code in this cell"""

import pandas as pd
df = pd.read_csv('AAPL_returns.csv',index_col = 0,parse_dates=True
ax = df.plot( x='date', y='daily Apple returns')
ax.set_xlabel("date")
ax.set_ylabel("daily Apple returns")
df.plot()

Out[41]:


<matplotlib.axes._subplots.AxesSubplot at 0x7f7cb1ae d6a0>

Question 2

Write a Python code that plots the time series of daily headline frequencies (the number of headlines per day on the y-axis and the corresponding date on the x-axis). Make sure your plot's axes are appropriately labelled.

Answer 2

In [*]:

"""Write your code in this cell"""

import matplotlib.pyplot as plt

df = pd.read_csv('Assignment4-data.csv', encoding = "ISO-8859-1")
df.head()
df.headlines.hist();

Question 3

We will use neural networks to explore the relationship between the content of financial news and the direction of stock returns, i.e., their classification into positive or negative returns.

  1. Create a new column called returns_direction in the dataframe that classifies daily returns based on their direction: it assigns a given return a value of 1, if the return is positive (i.e, greater than 0), and a value of 0 otherwise. You may find the Numpy function where() useful for this question.
  2. Count the number of days on which the stock had positive and non-positive returns, respectively.

Answer 3

In [ ]:

"""Write your code in this cell"""
# YOUR CODE HERE
raise NotImplementedError()

Question

For this question please restrict your computations to the first 100 headline dates. You can select them by using the head function of Pandas . Calculate the tf-idf metric for the following word and headline(s) pairs:

  1. Word "apple" in headlines with date 2008-01-07. Store this value in a variable called aaple_tfidf .
  2. Word "samsung" in headlines with date 2008-01-17. Store this value in a variable called samsung_tfidf .
  3. Word "market" for news headlines with dates 2008-03-06. Store this value in a variable called market_tfidf .

Please write a Python code that calculates the metrics from the df dataframe.

Answer 4

In [ ]:

"""Write your code in this cell"""
# YOUR CODE HERE
raise NotImplementedError()

Question 5

Build and train a one-layer neural network with two units (neurons) to explain return directions based on financial news. Report and interpret the following three performance measures: "Precision", "Recall", and "Accuracy". According to your opinion, which performance measure(s) is (are) most important in the context of linking news headlines to stock returns and why?

Answer 5 - Code

In [ ]:

"""Write your code in this cell"""
# YOUR CODE HERE
raise NotImplementedError()

Answer 5 - Text

YOUR ANSWER HERE

Question 6

Explore dierent neural network models by changing the number of layers and units.
You can use up to three layers and five units.

Complete the table below by adding your results for the test data set. You should duplicate the table format in your own markdown cell and replace the "-" placeholders with the corresponding values. Discuss your findings for both the test and train data sets.

Answer 6 - Code

In [ ]:

"""Write your code in this cell"""
# YOUR CODE HERE
raise NotImplementedError()

Answer 6 - Text

YOUR ANSWER HERE

Question 7

Explore the eects of dierent splits between the training and testing data on the performance of a given neural network model.

Complete the table below by adding your results. You should duplicate the table format in your own markdown cell and replace the "-" placeholders with the corresponding values. Discuss your findings.

Complete the table below by adding your results for the test data set. You should use the same markdown format and simply replace the "-" placeholders with the corresponding values. Discuss your findings for the dierent test and train data sets.

Answer 7 - Code

In [ ]:

"""Write your code in this cell"""
# YOUR CODE HERE
raise NotImplementedError()

Answer 7 - Text

YOUR ANSWER HERE

Question 8

Run a logistic regression with the same independent and dependent variables as used for the above neural network models. You have access to the sklearn package, which should help you answering this question. To work with the sklearn package, you may find the following links helpful.

  1. Building a logit model: https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
  2. Evaluating a logit model:

Compare and contrast your findings with the above findings based on neural network models.

Answer 8 - Code

In [ ]:

"""Write your code in this cell"""
# YOUR CODE HERE
raise NotImplementedError()

Answer 8 - Text

YOUR ANSWER HERE

Question 9

Everything you did so far was explaining stock returns with contemporaneous financial news that were released on the same date. To explore how well a neural network can predict the direction of future returns based on our text data, you should do the following.

  1. Please read the AAPL_returns.csv into a dataframe by using the parse_dates argument and create a new column returns_pred by shifting the returns by one trading day. For this purpose, you may find the shift function from Pandas helpful.
  2. Combine the df dataframe that contains headlines with this new dataframe such that for a given headline date, the value in returns_pred contains the return on the subsequent trading day.
  3. Train a neural network that uses financial news to learn the returns_pred variable. You are allowed to use any of the above neural network parameterisations and train/test data splits.
  4. Explain your findings with regard to the given data and your chosen parameters.

Interpret your results in the context of the Ecient Market Hypothesis (EMH).

Answer 9 - Code

In [ ]:

"""Write your code in this cell"""
# YOUR CODE HERE
raise NotImplementedError()

Answer 9 - Text

YOUR ANSWER HERE

(本文出自csprojectedu.com,转载请注明出处)


csprojectedu
751 声望201 粉丝

Microsoft, ACMer, 现BAT全栈工程师。


引用和评论

0 条评论