In this work textual information in the Management Discussion section of 10-K reports is used to derive useful insights from a business standpoint. These reports are publicly available through the EDGAR platform on the sec website. An NLP pipline is then constructed to clean and process the text for further analyis.
In part A a corpus is constructed by scraping the index of filings from the sec website, downloading the reports and extracting the relevant, item 7, section. Next, important keywords are examined which can provide value from either an analytically or business point of view. To this aim TF-IDF analysis is performed on unigrams, bigrams and trigrams.
In part B the goal is to link the sentiment of text to different financial indicators. Namely, abnormal returns, abnormal volume and volatility of abnormal returns. First prices, fundamental indicators and holidays dates are fetched to enable the calculation of abnormal returns, volume and variance. In addition to the fundamental indicators a number of technical indicators are calculated to build a more robust baseline model. Panel data analysis is performed.
In part C topic modelling is performed in an attempt to figure out which topics are commonplace in 10-K reports. Topic modelling is a useful technique because it can be used to instantly summarize lots of text. For example, on the date of report filling analysis could use this algorithm to summarize a report in a matter of seconds and link it to previous reports. Furthermore, topics can be linked to returns and other financial indicators.
library(edgar)
library(rvest)
library(tidyverse)
library(lubridate)
library(tidytext)
library(stringr)
library(magrittr) # for Tee pipe
library(httr) # scrape with headers
library(htmltidy) # clean broken html
library(tm.plugin.webmining) # remove html method 2
library(foreach)
library(doParallel)
library(lexicon) # stopwords
library(quantmod)
library(udpipe) # text anotation
library(lubridate) # date manipulation
library(bizdays) # business days manipulation
library(ggcorrplot) #plot correlation matrix
library(purrr) #map2
library(TTR) # TA indicators
library(splines) # for stm temporal
library(stm)
library(wordcloud)
library(SentimentAnalysis)
library(lmvar) # cross validation
library(magick)
library(cowplot)
library(svglite)
library(ggthemes)
library(kableExtra)
library(sjPlot)
We assume that the default action is not to buy stocks based on sentiment of 10-K reports and the corresponding null hypothesis is that stock movements are not related to sentiment pf 10-K reports.
portfolio <- read_csv("portfolio-data.csv")
names(portfolio) <- stringr::str_replace_all(names(portfolio), " ", "_")
portfolio <- portfolio %>% rename(cik = CIK)
portfolio %>%
mutate(GICS_Sub_Industry= as.factor(GICS_Sub_Industry)) %>%
group_by(GICS_Sub_Industry) %>%
count(sort = T) %>%
knitr::kable(caption = 'Portfolio option summary') %>%
kable_styling(position = "center")
GICS_Sub_Industry | n |
---|---|
Semiconductors | 13 |
Data Processing & Outsourced Services | 12 |
Application Software | 11 |
Technology Hardware, Storage & Peripherals | 7 |
IT Consulting & Other Services | 6 |
Communications Equipment | 5 |
Electronic Equipment & Instruments | 3 |
Internet Services & Infrastructure | 3 |
Semiconductor Equipment | 3 |
Systems Software | 3 |
Electronic Components | 2 |
Electronic Manufacturing Services | 2 |
Technology Distributors | 1 |
The top 3 industries have over 10 companies each. A portfolio is built from companies in these industries. The main reason for this approach is sample maximization. Primarily, this study aims to uncover a link between price and text. Secondarily,the goal is to examine the difference across industries. To achieve the later goal industries must be well represented in the sample. If differences in how price dynamics are detected across industries, further investigation can be made in industries that are less well represented. To summarize, the main motivation behind this portfolio choice is to maximize the statistical power of our study and generalisability of results.
industries <- c("Semiconductors",
"Data Processing & Outsourced Services",
"Application Software")
candidate_ciks <- portfolio %>%
filter(GICS_Sub_Industry %in% industries) %>%
pull(cik)
portfolio <- portfolio %>%
filter(cik %in% candidate_ciks)
rm(candidate_ciks, industries)
During the first attempt at extracting, the Edgar package was used. For the top 10 companies within each category 255 links were acquired from the master index through the edgar::getMasterIndex(2010:2020) call. In the best case scenario 70% of the records will be present. Furthermore, once the same package was used to extract the management description a further 58 records are lost leaving only 197 records for further analysis. This is 54% of the total theoretical size. Even with sophisticated replacement strategies such as extracting text from other sections and using 10-Qs instead of 10-K this is inadequate. Therefore, a custom scraping algorithm and section extraction algorithm is developed.
The results from scraping and parsing the daily_index from the sec website are significantly better than the ones provided by edgar package. Overall we get 10,991,106 records vs 7,942,110 records from the package, with roughly 20,000 more 10-K reports. Indeed for the selected industries all reports are available. A number of companies have less than 11 reports but research shows that this is a result of name change, and structural change of the company itself rather than a fault in the scraping procedure. Hence these companies are omitted as per original strategy and end up with 10 companies for each selected industry with 1 report corresponding to each year.
Another advantage of focusing on just a few sectors is that we reduce the variation of word use, which hopefully leads to more consistent sentiment scores for firms in the portfolio. Words are likely to have different meaning across companies and sectors This affects sentiment analysis negatively. By concentrating on a few industries variation is minimized.
Utility function to help combine urls.
combURL <- function(base, addons, type="") {
for (addon in addons) {
base <- paste(base, addon, sep = "/")
}
return(paste0(base, type))
}
This operation cannot be parallelized or otherwise sped up as we need to stay under the SEC limit of 10 call per second. Additionally, Sys.sleep(0.1) is used to slow down the script. Since parsing might take a long time, index files are saved at this stage and parsed later in parallel fashion.
domain <- "https://www.sec.gov/Archives/edgar/daily-index"
if(!dir.exists("master_idx")){
dir.create("master_idx")
}
for (year in 2010:2020) {
for (i in 1:4) {
qt <- paste0("QTR", i)
url <- combURL(domain, c(year, qt, "index.json" ))
Sys.sleep(0.1)
GET(url, user_agent("Mozilla/5.0"), write_disk("temp.json"))
file <- jsonlite::fromJSON("temp.json")
for (link in file$directory$item$name) {
if (str_detect(link, "master")) {
url <- combURL(domain, c(year, qt, link))
if(!dir.exists(combURL("master_idx", c(year)))) {
dir.create(combURL("master_idx", c(year)))
}
if(!dir.exists(combURL("master_idx", c(year, qt)))) {
dir.create(combURL("master_idx", c(year, qt)))
}
filename <- combURL("master_idx", c(year, qt, link))
GET(url, user_agent("Mozilla/5.0"), write_disk(filename))
Sys.sleep(0.1)
}
}
file.remove("temp.json")
}
}
cluster <- NULL
#utility function to register cluster.
register_cores <- function() {
n_cores <- parallel::detectCores() - 12
cluster <<- parallel::makeCluster(n_cores, type = "PSOCK")
doParallel::registerDoParallel(cl = cluster)
foreach::getDoParRegistered()
}
Parse idx files into a data frame for each year and quarter.
if(!dir.exists("master_indexes")){
dir.create("master_indexes")
}
#Setup for parallel computing
register_cores()
#Use "foreach" loop from the foreach package which supports parrallel operations.
for (year in 2010:2020) {
year_master <- foreach(
q = 1:4,
.combine=rbind,
.packages=c('tidyverse', 'stringr')
) %dopar% {
qt <- paste0("QTR", q)
url <- combURL("master_idx", c(year, qt))
files <- list.files(url)
q_master <- data.frame()
for (file in files) {
filename <- combURL("master_idx", c(year, qt, file))
file <- readLines(filename)
# split lines
file <- str_split(file, ' ')
# trim heading
file <- file[8:length(file)]
file_df <- data.frame()
for (i in 1:length(file)) {
# split into columns
l <- str_split(file[[i]][1], '\\|')
# convert to df
df <- data.frame(cik=l[[1]][1],
name=l[[1]][2],
form_type=l[[1]][3],
date=l[[1]][4],
link=l[[1]][5],
qtr = q)
file_df <- rbind(file_df, df)
}
q_master <- rbind(q_master, file_df)
}
return(q_master)
}
# save result
filename <- combURL("master_indexes", c(year), type = "_year_master.rda")
save(year_master, file = filename)
rm(year_master)
gc()
}
rm(q_master, df, file_df, l)
#close cluster
parallel::stopCluster(cl = cluster)
Combine all saved year master indexes into one data frame.
master_indexes <- list.files("master_indexes/",pattern="rda")
all_my_indexes <- data.frame()
for(master_index in master_indexes){
load(paste0("master_indexes/",master_index))
this_index <- year_master
all_my_indexes <- bind_rows(all_my_indexes,this_index)
print(master_index)
}
all_my_indexes <- all_my_indexes[-c(1:11),]
rm(this_index)
# update master index
all_my_indexes <- all_my_indexes %>%
filter(form_type == "10-K") %>%
filter(cik %in% portfolio$cik)
domain <- "https://www.sec.gov/Archives/"
if(!dir.exists("full_text")) {
dir.create("full_text")
}
for (i in 1:length(all_my_indexes$cik)) {
row = all_my_indexes[i,]
url <- paste0(domain, row$link)
print(url)
Sys.sleep(0.1)
dirname <- paste0("full_text/", row$cik)
dirname <- paste0(dirname, "/")
print(dirname)
if(!dir.exists(dirname)){
dir.create(dirname)
}
filename <- paste0(paste0(dirname,row$date),".txt")
print(filename)
if(!file.exists(filename)) {
print(filename)
GET(url, user_agent("Mozilla/5.0"), write_disk(filename))
}
}
rm(row, dirname, filename, url)
All files 369 downloaded successfully.
# clean document titles
# clean item tags
cleanDocTitle <- function(text) {
text <- str_replace(text, 'm ', 'm')
text <- str_replace(text, '> ', ' ')
text <- str_replace(text, '<[\\s\\S]*>', ' ')
text <- str_replace_all(text, '\n', ' ')
text <- str_replace_all(text, '"', ' ')
text <- str_replace_all(text, ' ', ' ')
text <- str_replace_all(text, ' ', ' ')
text <- str_replace_all(text, ' ', '')
text <- str_replace_all(text, '\\.', ' ')
text <- str_replace_all(text, '>', ' ')
text <- trimws(text)
text <- tolower(text)
return (text)
}
# remove html from text using rvest
strip_html <- function(text) {
if (!is.na(text)) {
if (text!= "") {
tryCatch( {
text <- html_text(read_html(text))
}, error=function(cond) {
text <- extractHTMLStrip(text)
})
}
}
return (text)
}
dirs <- list.dirs("full_text")
master <- data.frame()
getSections <- function(regex, text) {
start_end_section <- stringr::str_locate_all(text, regex)
start_end_section <- as.data.frame(start_end_section)
return(start_end_section)
}
for (dir in dirs[2:length(dirs)]) {
files <- list.files(dir)
for (file in files) {
date <- file
date <- str_remove(date, ".txt")
cik <- str_remove(dir, "full_text/")
file <- paste(dir, file, sep = "/" )
text <- read_file(file)
doc_start <- as.data.frame(stringr::str_locate_all(text, "<DOCUMENT>"))
doc_end <- as.data.frame(stringr::str_locate_all(text, "</DOCUMENT>"))
type <- as.data.frame(stringr::str_locate_all(text, '<TYPE>[^\n]+'))
for (i in 1:length(doc_start$start)) {
doc <- substr(text, type$start[i], type$end[i])
if (str_detect(doc, "10-K")) {
regex <- '(>| \\s|> |> )(Item|ITEM|Ite|It|"Item")(\\s| | |<a name="(1A|1B|7A|7|8|9|9A)">\\s|<.*?>m |\\s<.*?>)(1A|1B|7A|7|8|9|9A)\\.{0,1}'
start_end_section <- getSections(regex, text=text)
item7 <- NA
temp_df <- NA
tryCatch(
{
start_end_section$item <- cleanDocTitle(substring(text,
first=start_end_section$start,
last= start_end_section$end))
# select item 9 or 9a
is_item_9_detected = (start_end_section %>%
filter(item == "item9") %>%
count() %>% pull(n)) > 0
item9_lable <- ifelse(is_item_9_detected, "item9", "item9a")
# select top item 9 or 9a start index
top_item <- start_end_section %>%
filter(item == item9_lable) %>%
arrange(desc(start)) %>%
slice(1) %>%
pull(start)
if (!is.na(top_item)) {
# use top item 9 as upper bound
start_end_section <- start_end_section %>%
filter(start < top_item) %>%
filter(!(item %in% c("item9", "item9a")))
}
# top item from each item group
start_end_section <- start_end_section %>%
group_by(item) %>%
arrange(desc(start)) %>%
slice(1) %>%
ungroup()
# select item 8 or 7a
is_item_8_detected = (start_end_section %>%
filter(item == "item8") %>%
count() %>% pull(n)) == 1
end_lable <- ifelse(is_item_8_detected, "item8", "item7a")
row_index_item8 <- start_end_section %>%
filter(item == end_lable) %>%
pull(start)
# select item 7 or 7a
is_item_7_detected = (start_end_section %>%
filter(item == "item7") %>%
count() %>% pull(n)) == 1
start_lable <- ifelse(is_item_7_detected, "item7", "item7a")
row_index_item7 <- start_end_section %>%
filter(item == start_lable) %>%
pull(start)
# use item7a if item 7 is found after item 8. Preserves 3 reports.
if (row_index_item7 > row_index_item8) {
row_index_item7 <- start_end_section %>%
filter(item == "item7a") %>%
pull(start)
}
item7 <- substr(text, start = row_index_item7, stop = row_index_item8)
item7 <- strip_html(item7)
},
error=function(cond) {
print(cik)
print(date)
message("Error message:")
message(cond)
return(NA)
})
if (!is.na(item7)) {
if (item7 == "") {
print(start_end_section)
print(cik)
print(date)
}
}
temp_df <- data.frame(cik=cik, date=date, text=item7)
master <- rbind(master, temp_df)
}
}
}
}
rm(temp_df, file, item7, lable, row_index_item7, row_index_item8, start_end_section)
One report failed to parse. An investigation shows that there is actually no management discussion in the report. Instead there is reference to another report.
master %>%
mutate(text_size = nchar(text)) %$%
hist(text_size, breaks = 20, xlim = c(0, 1000000))
The distribution of text size helps detect more errors. Small text size is a sign that problems might have occurred. From this sample it seems anything bellow 2000 words is likely to not contain relevant text.
Both this companies are eliminate from our portfolio.
# filter by available reports
portfolio_ciks <- master %>%
mutate(text_size = nchar(text)) %>%
filter(!(cik %in% c(97476, 50863))) %>%
filter(!(is.na(text) | text == "")) %>%
filter(text_size > 2000) %>%
unique() %>%
mutate(cik = as.numeric(cik)) %>%
inner_join(portfolio) %>%
count(GICS_Sub_Industry, cik) %>%
group_by(GICS_Sub_Industry) %>%
arrange(desc(n), .by_group = TRUE) %>%
filter(n > 7) %>%
pull(cik)
## Joining, by = c("cik", "Symbol", "Security", "GICS_Sector", "GICS_Sub_Industry")
# update portfolio
portfolio <- portfolio %>%
filter(cik %in% portfolio_ciks)
# update parsed text df
master <- master %>%
unique() %>%
filter(!(is.na(text) | text == "")) %>%
mutate(text_size = nchar(text)) %>%
filter(text_size > 2000) %>%
mutate(date = as.Date(date, format = "%Y%m%d")) %>%
mutate(cik = as.numeric(cik)) %>%
filter(cik %in% portfolio_ciks) %>%
inner_join(portfolio) %>%
mutate(doc_id = row_number())
## Joining, by = c("cik", "Symbol", "Security", "GICS_Sector", "GICS_Sub_Industry")
rm(portfolio_ciks)
#saveRDS(master, "master.rds")
portfolio$GICS_Sector <- NULL
knitr::kable(portfolio, caption = "Portfolio of Selected Companies")
Symbol | Security | GICS_Sub_Industry | cik |
---|---|---|---|
ADBE | Adobe Systems Inc | Application Software | 796343 |
AMD | Advanced Micro Devices Inc | Semiconductors | 2488 |
ADS | Alliance Data Systems | Data Processing & Outsourced Services | 1101215 |
ADI | Analog Devices, Inc. | Semiconductors | 6281 |
ANSS | ANSYS | Application Software | 1013462 |
ADSK | Autodesk Inc. | Application Software | 769397 |
BR | Broadridge Financial Solutions | Data Processing & Outsourced Services | 1383312 |
CDNS | Cadence Design Systems | Application Software | 813672 |
CTXS | Citrix Systems | Application Software | 877890 |
FIS | Fidelity National Information Services | Data Processing & Outsourced Services | 1136893 |
FISV | Fiserv Inc | Data Processing & Outsourced Services | 798354 |
FLT | FleetCor Technologies Inc | Data Processing & Outsourced Services | 1175454 |
GPN | Global Payments Inc. | Data Processing & Outsourced Services | 1123360 |
INTU | Intuit Inc. | Application Software | 896878 |
JKHY | Jack Henry & Associates | Data Processing & Outsourced Services | 779152 |
MA | Mastercard Inc. | Data Processing & Outsourced Services | 1141391 |
MXIM | Maxim Integrated Products Inc | Semiconductors | 743316 |
MCHP | Microchip Technology | Semiconductors | 827054 |
MU | Micron Technology | Semiconductors | 723125 |
NLOK | NortonLifeLock | Application Software | 849399 |
NVDA | Nvidia Corporation | Semiconductors | 1045810 |
ORCL | Oracle Corp. | Application Software | 1341439 |
PAYX | Paychex Inc. | Data Processing & Outsourced Services | 723531 |
QCOM | QUALCOMM Inc. | Semiconductors | 804328 |
CRM | Salesforce.com | Application Software | 1108524 |
SWKS | Skyworks Solutions | Semiconductors | 4127 |
SNPS | Synopsys Inc. | Application Software | 883241 |
V | Visa Inc. | Data Processing & Outsourced Services | 1403161 |
WU | Western Union Co | Data Processing & Outsourced Services | 1365135 |
XLNX | Xilinx | Semiconductors | 743988 |
Html was already striped during parsing now further cleaning needs to be done to remove digits and symbols. At this stage two cleaning functions are defined; one to remove punctuation completely the other attempts to remove extra punctuation, mainly table leftovers, but preserve the sentence structure. Whilst punctuation is not necessary in most cases, the sentimentr package used in Part B relies on punctuation to identify inflection and analyses sentiment at sentence level. However, due to the presence of tables in this dataset removing punctuation accurately is challenging. Hence, the data is split into two formats.
clean_text_retain_puntuation <- function(text) {
#we trim the text to remove section title.
text <- stringr::str_sub(text,start = 94,end = -1)
#unescape unicode
text <- stringi::stri_unescape_unicode(text)
#sets all chars to unicode
text <- iconv(text, "ASCII", sub = " ")
#removes line breaks
text <- stringr::str_replace_all(text, "\r?\n|\r|\t", " ")
text <- stringr::str_replace_all(text, "[[digit:]]$", " ")
#removes digits
text <- tm::removeNumbers(text)
#remove $ and % sign
text <- stringr::str_replace_all(text, "\\$|\\$(\\.)", " ")
text <- stringr::str_replace_all(text, "%", " ")
text <- qdap::bracketX(text)
text <- str_squish(text)
#remove repeating charachters
text <- stringr::str_replace_all(text, '([?@])\\1+', " ")
text <- stringr::str_replace_all(text, '\\,\\.|\\.\\,', " ")
return (text)
}
text_with_punctuation <- parallel::mclapply(master$text, clean_text_retain_puntuation)
text_with_punctuation <- unlist(text_with_punctuation)
saveRDS(text_with_punctuation, "text_with_punctuation.rds")
clean_text <- function(text) {
#we trim the text to remove section title.
text <- stringr::str_sub(text,start = 94,end = -1)
#unescape unicode
text <- stringi::stri_unescape_unicode(text)
#sets all chars to unicode
text <- iconv(text, "ASCII", sub = " ")
#removes line breaks
text <- stringr::str_replace_all(text, "\r?\n|\r|\t", " ")
#removes digits
text <- tm::removeNumbers(text)
#remove non word breaking punctuation
text <- tm::removePunctuation(text,
preserve_intra_word_contractions = T,
preserve_intra_word_dashes = T)
return (str_squish(text))
}
cleaned <- parallel::mclapply(master$text, clean_text )
master$text <- unlist(cleaned)
rm(cleaned)
saveRDS(master, "master_cleaned.rds")
Next, Part of Speech tagging is conducted. Importantly, this is done before stopword removal as common stopowrds provide important grammatical information which helps the tagger distinguish between nouns and verbs. In other words, POS tagging looks at the sequence as a whole.
langmodel <- udpipe::udpipe_download_model("english")
langmodel <- udpipe::udpipe_load_model(langmodel$file_model)
postagged_text <- udpipe_annotate(langmodel,
master$text,
parallel.cores = 15,
trace = T)
postagged_text <- as.data.frame(postagged_text)
saveRDS(postagged_text, "postagged_text.rds")
postagged_text <- readRDS("postagged_text.rds")
master <- readRDS("master_cleaned.rds")
Stopwords are words that carry only grammatical meaning but provide little sentiment value on their own in a typical bag of words model. Thefore, they are removed to reduce the noise in the dataset and speed up computation.
In addition to standard stopword dictionaries such as the SMART lexicon, NLTK dictionary and Fry’s top 100 dictionary finance specific dictionaries are used. LM provide custom dictionaries for financial data on website:https://sraf.nd.edu/textual-analysis/resources/#StopWords
These are used to filter out names of auditors who audit the reports, names of management, references to geographic locations as well as numbers.
#load downloaded dictionaries
SW_Auditor <- data.frame(word = readLines("stopwords/StopWords_Auditor.txt"))
SW_Currencies <- read_delim("stopwords/StopWords_Currencies.txt", delim = "|")[1]
names(SW_Currencies) <- c("word")
SW_DatesNumbers <- data.frame(word = readLines("stopwords/StopWords_DatesandNumbers.txt"))
SW_Geographic <- data.frame(word = readLines("stopwords/StopWords_Geographic.txt"))
SW_Names <- data.frame(word = readLines("stopwords/StopWords_Names.txt"))
StopWords_LM <- rbind(SW_Auditor,SW_Currencies, SW_Names, SW_DatesNumbers, SW_Geographic)
#after POS we will take the lemma meaning words will be in lower case
StopWords_LM <- StopWords_LM %>% mutate(word = tolower(str_replace_all(word,"[^[:graph:]]", " ")))
rm(SW_Auditor,SW_Currencies, StopWords_Names, SW_DatesNumbers, SW_Geographic)
#remove words unrelated to content
my_stopwords <- data.frame(word = c("Table", "of", "Contents", "table", "contents"))
#remove company names
company_names <- portfolio %>% unnest_tokens(word, Security) %>% select(word)
my_stopwords <- rbind(my_stopwords, company_names)
rm(company_names)
#Load standard stopword dictionaries
stopwords_nltk<- as.data.frame(stopwords::data_stopwords_nltk$en)
data(sw_fry_100)
stopwords_fry <- as.data.frame(sw_fry_100)
names(stopwords_fry) <- c("word")
names(my_stopwords) <- c("word")
names(stopwords_nltk) <- c("word")
Rather than filtering by tf-idf a cautious approach is exercised. All words which appear more than 5 times are kept. This deals with the vast majority of parsing mistakes. Method specific tf-idf trimming is applied as need later on.
# document term frequency filter
document_term_freq_filter <- function(tokens) {
tokens <- tokens %>%
count(word) %>%
filter(n > 5) %>%
inner_join(tokens)
return(tokens)
}
#we use mistake detection to remove parsing errors
#we don't expect actual mistakes in 10-K reports
#mistake detection is done after legalization and POS filtering
#this saves computing time as we have less tokens
#hunspell wrapper -- pre configured to use bitish english followed by us english
hunspell_double_english <- function(word) {
word <- unlist(hunspell::hunspell(word, dict = 'en_US'))
if (is.character(word)) {
word <- unlist(hunspell::hunspell(word, dict = 'en_GB'))
}
return (word)
}
# function to monitor token count in pipe
count_tokens <- function(tokens, name = "") {
tokens %>%
count(word) %>%
summarise(total = sum(n)) %$%
print(total)
print(name)
return (tokens)
}
tokens <- postagged_text %>%
filter(upos %in% c("NOUN","ADJ","ADV")) %>%
select(lemma, doc_id) %>%
rename(word = lemma) %>%
mutate(word = tolower(word)) %>%
count_tokens("Stage: Initial") %>%
anti_join(stop_words, by = "word") %>%
count_tokens("Stage 1: after SMART") %>%
anti_join(stopwords_nltk, by = "word") %>%
count_tokens("Stage 2: after NLTK") %>%
anti_join(my_stopwords, by = "word") %>%
count_tokens("Stage 3: after My Stopwords") %>%
anti_join(StopWords_LM, by = "word") %>%
count_tokens("Stage 4: after LM Stopwords") %>%
mutate(token_length=nchar(word)) %>%
arrange(token_length) %>%
filter(token_length > 3) %>%
filter(token_length < 17) %>%
arrange(token_length) %>%
count_tokens(name="stage 5: Remove Token Length Filter") %>%
document_term_freq_filter() %>%
count_tokens(name="Stage 6: Term Frequency Filter")
lematized <- tokens %>%
group_by(doc_id) %>%
summarise(documents_pos_tagged = paste(word,collapse = " "))
# remove mistakes
mistakes <- parallel::mclapply(lematized$documents_pos_tagged, hunspell_double_english)
mistakes <- unique(mistakes)
mistakes <- data.frame(word = unlist(mistakes))
master <- master %>%
mutate(doc_id = paste0("doc",row_number()))
tokens <- tokens %>%
anti_join(mistakes) %>%
count_tokens(name="Stage 7: After Mistake Removal") %>%
inner_join(master %>% select(-text))
#saveRDS(mistakes, "mistakes.rds") #backup
#update lematized text
lematized <- tokens %>%
group_by(doc_id) %>%
summarise(documents_pos_tagged = paste(word,collapse = " "))
#add lematized text to main df
master <- master %>%
left_join(lematized)
rm(mistakes, lematized, langmodel, StopWords_LM, stopwords_nltk, stopwords_fry)
saveRDS(tokens, "tokens.rds")
saveRDS(master, "master_pos.rds")
At this stage, Tf-Idf analysis is used as an exploratory tool to better understand what terms in 10-K reports are important among different industries. Functions are defined which allow for dynamic tf-idf trimming and document term frequency filtering within groups. This enables exploration of keywords at different grouping levels with various levels of trim. 3 types of tokens are surveyed; unigrams, bigrams and trigrams.
The methodology used is as follows: * Examine words ranked by frequency * Examine words ranked by frequency with additional trimming * Examine words ranked by tf-idf * Examine words ranked by tf-idf with additional trimming
#bind tf-idf by specified measure
bind_tf_idf_custom <- function(tokens, by, within_group_freq_bound) {
tokens <- tokens %>%
drop_na(.data[[by]]) %>%
drop_na(word) %>%
count(word, .data[[by]]) %>%
filter(n > within_group_freq_bound) %>%
bind_tf_idf(word, .data[[by]], n)
}
#filter tokens by tfidf for specified quantiles
trim_by_tfidf <- function(tokens, quantiles) {
if (!is.null(quantiles)) {
quantiles <- tokens %$%
quantile(tf_idf, probs = quantiles) %>%
tidy(quantiles, na.rm = F)
tokens <- tokens %>%
filter(tf_idf > quantiles$x[1], tf_idf < quantiles$x[2])
}
return(tokens)
}
#group and count based on measure specified
summarise_conditionally <- function(tokens, measure, group_by) {
if (measure != "n") {
tokens <- tokens %>%
group_by(.data[[group_by]]) %>% unique()
} else {
tokens <- tokens %>%
group_by(.data[[group_by]], word) %>%
summarise(n= sum(n))
}
return(tokens)
}
#bind by category, group by same or other category
#filter by tf-idf or within group frequency
#plot top n tokens for group
filter_bind_plot <- function(df,
tokens,
id,
bind_by,
group_by,
measure,
within_group_freq_bound = 0,
quantiles = NULL,
n = 10) {
meta <- df %>% select(.data[[id]],.data[[group_by]], .data[[ bind_by]]) %>% unique(.)
tokens %>%
bind_tf_idf_custom(by=bind_by, within_group_freq_bound) %>%
trim_by_tfidf(quantiles) %>%
inner_join(meta) %>%
select(.data[[measure]], .data[[group_by]], word) %>%
summarise_conditionally(measure = measure, group_by = group_by) %>%
arrange(desc(.data[[measure]])) %>%
mutate(row_number = row_number()) %>%
filter(row_number %in% 1:15) %>%
facet_bar(y = word, x = .data[[measure]], by = .data[[group_by]], name = name)
}
#adapted from (Yan, 2020)
#utility function which reorders words based on give measure within groups
facet_bar <- function(df, y, x, by, nrow = 1, ncol = 3, scales = "free", name="") {
mapping <- aes(y = reorder_within({{ y }}, {{ x }}, {{ by }}),
x = {{ x }},
fill = {{ by }})
facet <- facet_wrap(vars({{ by }}),
nrow = nrow,
ncol = ncol,
scales = scales)
ggplot(df, mapping = mapping) +
geom_col(show.legend = FALSE) +
scale_y_reordered() +
facet +
ylab("") +
theme_light()
}
#dir to save images
if(!dir.exists("PartA")){
dir.create("PartA")
}
filter_bind_plot(master,
tokens,
id="doc_id",
group_by = "GICS_Sub_Industry",
bind_by = "GICS_Sub_Industry",
measure = "n", n =15)
ggsave("PartA/figure1.png", width = 40, height = 15, units = "cm")
Top 15 terms in per GICS_Sub_Industry ranked by frequency
filter_bind_plot(master,
tokens,
id="doc_id",
group_by = "GICS_Sub_Industry",
bind_by = "cik",
measure = "n", n =15)
ggsave("PartA/figure2.png", width = 40, height = 15, units = "cm")
Defining document level at company or entire industry does not yield much variation in terms of most frequent terms. With little variation accross industries it appears that these tokens are used in any typical 10-K report. This makes sense as companies are expected to discuss “revenue”, “cost” and “income”.
filter_bind_plot(master,
tokens,
id="doc_id",
group_by = "GICS_Sub_Industry",
bind_by = "GICS_Sub_Industry",
measure = "n",
quantiles = c(0, 0.9999), n=15)
ggsave("PartA/figure3.png", width = 40, height = 15, units = "cm")
Trimming the bag of words model using tf-idf removes most of the overlapping frequent terms.
filter_bind_plot(master,
tokens,
id="cik",
group_by = "GICS_Sub_Industry",
bind_by = "GICS_Sub_Industry",
measure = "tf_idf", n=15)
ggsave("PartA/figure4.png", width = 40, height = 15, units = "cm")
Ranking words using tf-idf without trimming results in a selection of specific industry terms. For example, looking at the semiconductor industry, terms such a “wafer”, “gigabit”, “chipset” and “foundry” are dominant. These all refer to the manufacturing or components of video cards.
filter_bind_plot(master,
tokens,
id="doc_id",
group_by = "GICS_Sub_Industry",
bind_by = "GICS_Sub_Industry",
measure = "tf_idf",
within_group_freq_bound = 100)
ggsave("PartA/figure5.png", width = 40, height = 15, units = "cm")
By applying heavy within group term frequency filter removes rare terms with high tf-idf allowing more dominant terms to stand out. For example, the name “MasterCard” is successfully removed from Data Processing industry.
Overall this final analysis presents a good illustration of important keywords across the 3 industries.
In the case of the application software subscription is the most dominant term.
filter_bind_plot(master,
tokens,
id="doc_id",
group_by = "GICS_Sub_Industry",
bind_by = "GICS_Sub_Industry",
measure = "tf_idf",
quantiles = c(0.01, 0.98))
ggsave("PartA/figure6.png", width = 40, height = 15, units = "cm")
Trimming using tf-dif rather than within group frequency yields a messier set of terms. This is because infrequent terms can still persist amongs the different catagories.
bigrams <- master %>%
unnest_tokens(word, documents_pos_tagged, token="ngrams", n=2) %>%
drop_na(word)
ggplot(head(bigrams %>% group_by(word) %>% count() %>% arrange(desc(n)),15),
aes(reorder(word,n), n)) +
geom_bar(stat = "identity") + coord_flip() +
xlab("Bigrams") + ylab("Frequency") +
ggtitle("Most frequent bigrams")
ggsave("PartA/figure7.png", width = 40, height = 15, units = "cm")
filter_bind_plot(master,
bigrams,
id="doc_id",
group_by = "GICS_Sub_Industry",
bind_by = "GICS_Sub_Industry",
measure = "n", quantiles = c(0.25, 0.98))
ggsave("PartA/figure8.png", width = 40, height = 15, units = "cm")
filter_bind_plot(master,
bigrams,
id="doc_id",
group_by = "GICS_Sub_Industry",
bind_by = "GICS_Sub_Industry",
measure = "tf_idf", within_group_freq_bound = 50)
ggsave("PartA/figure9.png", width = 40, height = 15, units = "cm")
Analysis of Trigrams is not very informative. Nonetheless, it does support some of the finding from unigram analyis. For example, in the semi-conductors industry reports many parts of video cards are metnioned whilst in application software subscription is commonly discussed.
For trigrams raw text is used in an attempt to get more meaningfull results.
trigrams <- master %>%
unnest_tokens(word, text, token="ngrams", n=3) %>%
drop_na(word)
ggplot(head(trigrams %>%
group_by(word) %>%
count() %>%
arrange(desc(n)),15), aes(reorder(word,n), n)) +
geom_bar(stat = "identity") + coord_flip() +
xlab("Tigrams") + ylab("Frequency") +
ggtitle("Most frequent tigrams")
ggsave("PartA/figure10.png", width = 40, height = 15, units = "cm")
filter_bind_plot(master,
trigrams,
id="doc_id",
group_by = "GICS_Sub_Industry",
bind_by = "GICS_Sub_Industry",
measure = "tf_idf")
A “Air miles reward” suggests that credit card rewards are an important topic in the Data Pocessing and Outsourced Services sub industry. Not surprising given that Visa and Mastercard are part of this sector. Software license updates, hardware system support as well as subscriptions are domminat tokens in the Application Software Sector. In the semiconductor industry the focus is on shipping and manifacturing. This insight will be usefull when lableing topics in the topic modleling phase.
An event study is performed at this stage. Event studies have a long history going back to 1933. The main idea behind event studies is to compare the return of a stock during some window after an event to some baseline estimated from past data. This return is known as the abnormal return which can be attributed to the event. Largely we follow the methodology of MacKinlay (1997).
Formally abnormal return is defined as:
\[ AR_{it} = R_{it} - E(R_{it}|X_t) \]
Abnormal Returns = Actually - Normal for time window t. \[ X_t \] represents some conditional information on which the normal return is modeled.
Normal return needs to be modeled before we begin our analysis. A simple approach would be to take the return on the stock a week before. However, this is not a sound approach since anticipation of filings report is already affecting the market prices. In practice two models are often used; the constant mean return model and the market model. The first assumes that a given security has a constant mean return across time. The second assumes that there is a linear relationship between the return on the security and the market (MacKinlay, 1997). The market model is an improvement over the constant mean model because it helps reduce the variance associated with market movements. There are more complicated models such as the Fama French 3 factor model and 5 factor model which help reduce the variance associated with different firm types. However, in our case we assume that there are indeed abnormal returns at least in some cases and we seek to understand whether these can be attributed to sentiment of the management section of the report. So instead of including fundamental indicator information into market model during the abnormal return estimation phase it is used as a control during the regression on sentiment phase.
After a model is chosen and event window defined, abnormal and cumulative abnormal returns on a stock after filling are calculated by subtracting benchmark/normal returns from actual returns. Following MacKinlay (1997) the estimation window for normal returns is defined as 250 trading days before the event window. This is roughly equivalent to 1 calendar year. The event window is defined as 2 weeks before filing and one week after filing. The two week gap serves to ensure independence of normal returns from the event driven returns. Abnormal returns are only calculated for the period after the filing.
Before prices are fetched to calculate returns dates need to formated and adjusted. This is done because the stock market is closed on weekends and holidays. This means that if we simply take n days after and before filing to calculate some financial indicator we will get NA values if this date falls on a holiday. To prevent this we need to offset days based on the business day calendar. Hence we declare holidays using to the bizdays package and offset using its api. This approach is in line with how returns are normally calculated in the financial sector.
Holidays for which US stock exchange does not work in 2021:
Note that many holidays are shifted when they fall on weekdays, whilst others depend on week number. Other holidays like “Inauguration Day” occur every 4 years. This means that formulating a rules based approach is tedious and error prone. Instead a publicly available API is used to get all the days for US Federal holidays. However, Good Friday is not a national holiday but a state one. Therefore, dates for this holiday a scraped from another website. A number of dates are added manually: the stock market closed during Hurricane Sandy and to commemorate George Bush’s death.
domain <- "https://date.nager.at/api/v3/PublicHolidays"
years <- 2008:2021
holidays <- c()
for (year in years) {
url <- url(combURL(domain, c(year, "US")))
json <- jsonlite::stream_in(url)
holidays <- c(holidays, json$date)
on.exit(close(url))
}
#get page with religious public holidays
public_holidays <- read_html("http://www.maa.clell.de/StarDate/publ_holidays.html")
tbl1 <- public_holidays %>% html_nodes("table") %>%
.[[6]] %>%
html_table() %>%
as.data.frame() %>%
filter(X1 %in% years) %>%
mutate(X3 = str_replace_all(X3, "\\.", "-")) %>%
mutate(date = ymd(ydm(paste(X1, X3, sep="-")))) %>%
pull(date)
tbl2 <- public_holidays %>% html_nodes("table") %>%
.[[7]] %>%
html_table() %>%
as.data.frame() %>%
filter(X1 %in% years) %>%
mutate(X3 = str_replace_all(X3, "\\.", "-")) %>%
mutate(date = ymd(ydm(paste(X1, X3, sep="-")))) %>%
pull(date)
good_friday_dates <- c(tbl1, tbl2)
# add dates manually
dates <- c("2018-12-05", # george bushes death
"2012-10-29", # Hurricane Sandy
"2012-10-30") # Hurricane Sandy
holidays <- c(holidays, as.character(good_friday_dates), dates)
# declare holidays and weekends
create.calendar(name="mycal",
weekdays=c('saturday', 'sunday'),
holidays=holidays)
rm(good_friday_dates, dates, tbl2, tbl1, year, json, public_holidays, url, domain, years)
master <- master %>%
mutate(year = format(date, '%Y'))
master$year <- as.numeric(master$year)
domain <- "https://www.macrotrends.net/stocks/charts"
endpoints <- c("pe-ratio", "shares-outstanding", "eps-earnings-per-share-diluted",
"debt-equity-ratio", "roe", "roi", "roa")
colums_needed <- c("PE Ratio", "Debt to Equity Ratio", "Return on Equity",
"Return on Investment", "Return on Assets")
fundamental_indicators <- data.frame()
for (i in 1:nrow(portfolio)) {
company_fundamentals <- data.frame()
company_name <- portfolio [i,]$Security
company_symbol <- portfolio [i,]$Symbol
company_name <- str_replace_all(company_name, ' ', "-")
for (endopoint in endpoints) {
url <- combURL(domain, c(company_symbol, company_name, endopoint))
print(url)
html <- read_html(url)
tbl <- html %>% html_nodes("table") %>% .[[1]] %>% html_table() %>% as.data.frame()
if (!(endopoint %in% c("shares-outstanding", "eps-earnings-per-share-diluted"))) {
names(tbl) <- as.matrix(tbl[1, ])
tbl <- tbl[-1, ]
tbl[] <- lapply(tbl, function(tbl) type.convert(as.character(tbl)))
}
temp_df <- data.frame(date=as.character(tbl[,1]),
value=tbl[,names(tbl) %in% colums_needed])
if (ncol(temp_df) == 1) {
temp_df <- data.frame(date=parse_number(as.character(tbl[,1])),
value=parse_number(as.character(tbl[,2])))
} else {
temp_df <- data.frame(date=parse_number(as.character(temp_df[,1])),
value=parse_number(as.character(temp_df[,2])))
}
names(temp_df) <- c("date", endopoint)
if (!(endopoint %in% c("shares-outstanding", "eps-earnings-per-share-diluted"))) {
temp_df <- temp_df %>%
mutate_if(is.character, ~ year(ymd(date)))
}
temp_df <- temp_df %>% filter(date %in% 2008:2020)
temp_df <- aggregate(temp_df[,2], list(temp_df$date), mean)
names(temp_df) <- c("year", endopoint)
if (length(company_fundamentals) == 0) {
company_fundamentals <- temp_df
} else {
company_fundamentals <- temp_df %>% inner_join(company_fundamentals)
}
company_fundamentals$Symbol <- company_symbol
}
fundamental_indicators <- rbind(fundamental_indicators, company_fundamentals)
}
names(fundamental_indicators) <- str_replace_all(names(fundamental_indicators), "-", "_")
rm(company_fundamentals, company_name, company_symbol, endopoint,
endpoints, domain, temp_df, url, colums_needed, tbl, html)
colSums(!is.na(fundamental_indicators[,2:ncol(fundamental_indicators )]))
#roi column has lots of missing values unlike all other columns
#given the small size of our data set this indicator is dropped
fundamental_indicators <- fundamental_indicators %>%
select(-roi)
saveRDS(fundamental_indicators, "fundamental_indicators.rds")
master %>%
inner_join(fundamental_indicators, by = c("Symbol", "year")) %>%
ggplot(., aes(x=roa)) +
geom_density(aes(group=GICS_Sub_Industry, fill=GICS_Sub_Industry), alpha = 0.4)
Sminconductor industry has higher roa indicating higher profitability per assets deployed.
master %>%
inner_join(fundamental_indicators, by = c("Symbol", "year")) %>%
ggplot(., aes(x=roe)) +
geom_density(aes(group=GICS_Sub_Industry, fill=GICS_Sub_Industry), alpha = 0.4) +
xlim(-50, 100)
## Warning: Removed 22 rows containing non-finite values (stat_density).
master %>%
inner_join(fundamental_indicators, by = c("Symbol", "year")) %>%
ggplot(., aes(x=pe_ratio)) +
geom_density(aes(group=GICS_Sub_Industry, fill=GICS_Sub_Industry), alpha = 0.4) +
xlim(0, 100)
## Warning: Removed 18 rows containing non-finite values (stat_density).
Application Software has a more spread out distribution of PE ratio. Semiconductor industry is most conservative in terms of price per earnings.
eps_earnings_per_share_diluted
master %>%
inner_join(fundamental_indicators, by = c("Symbol", "year")) %>%
ggplot(., aes(x=debt_equity_ratio)) +
geom_density(aes(group=GICS_Sub_Industry, fill=GICS_Sub_Industry), alpha = 0.4) +
xlim(0, 15)
## Warning: Removed 17 rows containing non-finite values (stat_density).
The semiconductor industry has lowest levels of leverage per equity available. Data Proccessing and Outsourced Services sectors has highest amount of debt.
master %>%
inner_join(fundamental_indicators, by = c("Symbol", "year")) %>%
ggplot(., aes(x=eps_earnings_per_share_diluted)) +
geom_density(aes(group=GICS_Sub_Industry, fill=GICS_Sub_Industry), alpha = 0.4)
Earnings per share appear to be largely the same across the industries examined.
Log returns are used in this analysis. Using log price has a number of advantages, including ease of arithmetic manipulation. In most cases, with exception with some technical indicators, adjusted closing price is used since it takes into account corporate actions such as stock splits.
Note, some filings take place even when stock market is closed; namely during Hurricane Sandy. So we need to offset the original date also in this rare case. This makes our script more future proof.
EXAMPLE <- master[1,]
comp <- getSymbols(EXAMPLE$Symbol, from=bizdays::offset(EXAMPLE$date, -30, "mycal"),to=bizdays::offset(EXAMPLE$date, 30, "mycal"), auto.assign=FALSE)
daterange <- paste(bizdays::offset(EXAMPLE$date, -30, "mycal"), bizdays::offset(EXAMPLE$date, 30, "mycal"), sep = "::")
chartSeries(comp,
subset=daterange,
theme=chartTheme('white'), TA = "addTA(xts(TRUE,EXAMPLE$date),on=-1, col='blue')")
EXAMPLE <- master[1,]
getSymbols(EXAMPLE$Symbol, from=bizdays::offset(EXAMPLE$date, -30, "mycal"),to=bizdays::offset(EXAMPLE$date, 30, "mycal"))
daterange <- paste(bizdays::offset(EXAMPLE$date, -30, "mycal"), bizdays::offset(EXAMPLE$date, 30, "mycal"), sep = "::")
chartSeries(get(EXAMPLE$Symbol),
subset=daterange,
theme=chartTheme('white'), TA = "addTA(xts(TRUE,EXAMPLE$date),on=-1, col='blue')")
EXAMPLE <- master[2,]
getSymbols(EXAMPLE$Symbol, from=bizdays::offset(EXAMPLE$date, -30, "mycal"),to=bizdays::offset(EXAMPLE$date, 30, "mycal"))
daterange <- paste(bizdays::offset(EXAMPLE$date, -30, "mycal"), bizdays::offset(EXAMPLE$date, 30, "mycal"), sep = "::")
no_axis <- x <- chartSeries(ANSS,
subset=daterange,
theme=chartTheme('white'), TA = "addTA(xts(TRUE,EXAMPLE$date),on=-1, col='blue')")
EXAMPLE <- master[13,]
getSymbols(EXAMPLE$Symbol, from=bizdays::offset(EXAMPLE$date, -30, "mycal"),to=bizdays::offset(EXAMPLE$date, 30, "mycal"))
daterange <- paste(bizdays::offset(EXAMPLE$date, -30, "mycal"), bizdays::offset(EXAMPLE$date, 30, "mycal"), sep = "::")
chartSeries(ANSS,
subset=daterange,
theme=chartTheme('white'), TA = "addTA(xts(TRUE,EXAMPLE$date),on=-1, col='blue')")
Note, prices for 2009 are also extracted as reports filed in 2010 have an estimation period outside the bounds of the specified date range.
tickers <- master %>% pull(Symbol) %>% unique()
#list to store all prices
prices <- lapply(tickers, getSymbols, auto.assign=FALSE, from='2009-01-01',to='2020-12-31')
names(prices) <- tickers
Daily log returns are calculated. Log returns can be summed up together to get cumulative returns for a given period.
my_return <- function(x) {
y <- dailyReturn(x, type="log")
names(y) <- strsplit(names(x)[1], "\\.")[[1]][1]
y
}
returns <- lapply(prices, my_return)
names(returns) <- tickers
Volume is the total number of a security traded in a given time period. However, companies have different number of outstanding shares. Thus, we first need to normalize the daily volume data to enable comparison between companies. Furthermore, similar to abnormal returns log of volume per share is used. This is done because volume is not normally distributed, breaking statistical and financial assumption (Yadav, 1992). Log transform solves this issue, but adding a small constant is required to prevent NA values at zero volume (Yadav, 1992). Additional information about the formula used can be found at: https://www.eventstudytools.com/volume-event-study
dailyVolume <- function(v, y, df) {
shares <- df %>% filter(year == y) %>% pull(shares_outstanding)
vol <- log(((v + 0.00025)/shares*1000))
vol
}
my_volume <- function(i) {
x <- prices[[i]]
name <- names(prices)[i]
df <- fundamental_indicators %>% filter(Symbol == name)
v=Vo(x)
v$vol <- mapply(dailyVolume, v=v, year(index(x)), list(df))
return(v[,2])
}
volume <- lapply(seq_along(prices), my_volume)
names(volume) <- tickers
# utility function to help subset list of returns calculated earlier
subset_comp <- function(x, start_date, end_date) {
x <- x[as.character(paste(start_date, end_date, sep = "/"))]
as.data.frame(x)
}
# get avg market return for specified date range
# exclude target symbol
get_market_return <- function(symbol, estimation_start_date, estimation_end_date) {
avg_market_daily_returns <- mapply(FUN = subset_comp,
x = returns[tickers[tickers != symbol]],
start_date = estimation_start_date,
end_date = estimation_end_date)
avg_market_daily_returns <- as.data.frame(
matrix(unlist(avg_market_daily_returns),
nrow=length(unlist(avg_market_daily_returns[1]))))
avg_market_daily_returns <- rowMeans(avg_market_daily_returns, na.rm=T)
return(avg_market_daily_returns)
}
# get avg market volume for specified date range
# exclude target symbol
get_market_volume <- function(symbol, estimation_start_date, estimation_end_date) {
avg_market_daily_volume <- mapply(FUN = subset_comp,
x = volume[tickers[tickers != symbol]],
start_date = estimation_start_date,
end_date = estimation_end_date)
avg_market_daily_volume <- as.data.frame(
matrix(unlist(avg_market_daily_volume),
nrow=length(unlist(avg_market_daily_volume[1]))))
avg_market_daily_volume <- rowMeans(avg_market_daily_volume)
print(avg_market_daily_volume)
return(avg_market_daily_volume)
}
# fit market model
get_market_model <- function(avg_market_daily_returns_in_est, comp_returns_in_est) {
print(avg_market_daily_returns_in_est)
print(comp_returns_in_est)
x <- cbind(comp_returns_in_est, avg_market_daily_returns_in_est)
x <- as.data.frame(x)
names(x) <- c("company_returns", "market_return")
market_model <- lm(company_returns ~ market_return, data=x)
return(market_model)
}
# predict using market model
calculate_normal_returns <- function(market_model, comp_returns, avg_market_daily_returns) {
x <- cbind(comp_returns, avg_market_daily_returns)
x <- as.data.frame(x)
names(x) <- c("company_returns", "market_return")
normal_returns <- predict(market_model, x)
}
#
# Main function to be used in a mapping operation.
#
returns_calc <- function(symbol, date) {
#offset required dates
date <- if_else(date %in% as.Date(holidays), bizdays::offset(date, 1, "mycal"), date)
day_before <- bizdays::offset(date, -1, "mycal")
next_day <- bizdays::offset(date, 1, "mycal")
next_week <- bizdays::offset(day_before, 5, "mycal")
next_month <- bizdays::offset(day_before, 21, "mycal")
next_year <- bizdays::offset(day_before, 250, "mycal")
prior_2_weeks <- bizdays::offset(day_before, -14, "mycal")
estimation_start_date <- bizdays::offset(prior_2_weeks, -250, "mycal")
# VOLUME
# next day
# market volume during estimation period
avg_market_daily_volume <- get_market_volume(symbol, estimation_start_date, prior_2_weeks)
# target company volume during estimation period
interval <- as.character(paste(estimation_start_date, prior_2_weeks, sep = "/"))
comp_volume_in_est <- as.numeric(volume[[symbol]][interval, ])
# market model
market_model <- get_market_model(avg_market_daily_volume, comp_volume_in_est)
# calculating normal volume after filling
avg_market_daily_volume <- get_market_volume(symbol, date, next_day)
interval <- as.character(paste(date, next_day, sep = "/"))
comp_volume <- as.numeric(volume[[symbol]][interval, ])
normal_volume <- calculate_normal_returns(market_model, comp_volume, avg_market_daily_volume)
# calculating abnormal volume after filling
abnormal_volume <- comp_volume - normal_volume
avg_abnormal_volume_next_day <- mean(abnormal_volume)
cum_abnormal_volume_next_day <- sum(abnormal_volume)
volume_direction_next_day <- ifelse(cum_abnormal_volume_next_day > 0, 1, 0)
print("cum abnormal vol")
print(cum_abnormal_volume_next_day)
# RETURNS
# market returns during estimation period
avg_market_daily_returns_in_est <- get_market_return(symbol, estimation_start_date, prior_2_weeks)
# target company return during estimation period
interval <- as.character(paste(estimation_start_date, prior_2_weeks, sep = "/"))
comp_returns_in_est <- as.numeric(returns[[symbol]][interval, ])
# market model
market_model <- get_market_model(avg_market_daily_returns_in_est, comp_returns_in_est)
#AFTER FILLING
# next day
# calculating normal returns after filling
avg_market_daily_returns <- get_market_return(symbol, date, next_day)
interval <- as.character(paste(date, next_day, sep = "/"))
comp_returns <- as.numeric(returns[[symbol]][interval, ])
normal_returns <- calculate_normal_returns(market_model, comp_returns, avg_market_daily_returns)
# calculating abnormal returns after filling
abnormal_returns <- comp_returns - normal_returns
avg_abnormal_return_next_day <- mean(abnormal_returns)
cum_abnormal_return_next_day <- sum(abnormal_returns)
direction_next_day <- ifelse(cum_abnormal_return_next_day > 0, 1, 0)
# week
# calculating normal returns after filling
avg_market_daily_returns <- get_market_return(symbol, date, next_week)
interval <- as.character(paste(date, next_week, sep = "/"))
comp_returns <- as.numeric(returns[[symbol]][interval, ])
normal_returns <- calculate_normal_returns(market_model, comp_returns, avg_market_daily_returns)
# calculating abnormal returns after filling
abnormal_returns <- comp_returns - normal_returns
avg_abnormal_return <- mean(abnormal_returns)
cum_abnormal_return <- sum(abnormal_returns)
variance_next_week <- var(abnormal_returns)
#BEFORE FILLING
before <- prices[[symbol]][as.character(paste(prior_2_weeks, day_before, sep = "/")), ]
# TA indicators
# roc - momentum indicator
roc <- as.numeric(ROC(Ad(before),n = 7)[day_before])
# standard moving average
ma7 <- as.numeric(SMA(Ad(before), 7)[day_before])
# rsi - momentum indicator - strength of current movement direction
rsi <- as.numeric(RSI(Ad(before), 7)[day_before])
# is a measure of the money flowing into or out of a security.
obv <- as.numeric(OBV(Ad(before), Vo(before))[day_before])
df <- cbind(cum_abnormal_return, avg_abnormal_return,
avg_abnormal_return_next_day, cum_abnormal_return_next_day,
avg_abnormal_volume_next_day,cum_abnormal_volume_next_day, volume_direction_next_day,
roc, ma7, rsi, obv, variance_next_week)
names(df) <- c("cum_abnormal_return_next_week", "avg_abnormal_return_next_week",
"avg_abnormal_return_next_day", "cum_abnormal_return_next_day",
"avg_abnormal_volume_next_day","cum_abnormal_volume_next_day", "volume_direction_next_day",
"roc", "ma7", "rsi", "obv", "variance_next_week")
df
}
indicators <- master %>% select(doc_id, Symbol, date)
x <- mcmapply(returns_calc, symbol=indicators$Symbol, date=indicators$date)
indicators <- data.frame(cbind(as.data.frame(t(x)), indicators), row.names = 1:nrow(indicators))
rm(prices, returns, volume, tickers, end.time, start.time,
my_return, my_volume, returns_calc, subset_comp, i)
saveRDS(indicators, "indicators.rds")
Some of the indicators choosen are highly correlated. Therefore, one or two need to be removed to prevent multicolinearity problems.
corr <- round(cor(fundamental_indicators %>%
select(roa, roe, debt_equity_ratio,
eps_earnings_per_share_diluted, pe_ratio),
method="spearman"), 1)
ggcorrplot(corr)
Roe and Roa are correlated; one of them should be removed before regression.
hist(indicators$avg_abnormal_return_next_day, breaks = 40)
hist(indicators$cum_abnormal_return_next_day, breaks = 40)
Examining the distribution of average abnormal return shows that in most cases there are no abnormal returns. Therefore, there is an opportunity to create a new feature – movement direction. Loss id defined as the 25th percentile of abnormal returns whilst gain is defined as everything above the 75th percentile. Everything in between is defined as stay. This feature engineering should reduce the noise in our target variable.
indicators <- indicators %>%
mutate(movment_direction =
ifelse(cum_abnormal_return_next_day < quantile(indicators$cum_abnormal_return_next_day)[2],
"loss", "stay")) %>%
mutate(movment_direction =
ifelse(cum_abnormal_return_next_day > quantile(indicators$cum_abnormal_return_next_day)[4],
"gain", movment_direction)) %>%
mutate(movment_direction = factor(movment_direction))
Tidytext is used to get afinn, nrc, and bind dictionaries. Alternatively, the syuzhet package can be used. However, tidytext dictionaries provide more flexibility in terms of token manipulation and sentiment calculation. SentimentAnalysis package is used to get polarity with Loughran and McDonald’s, Harvard’s General Inquirer and Henry’s dictionaries as well as LM Uncertainty Ratio. The Loughran and McDonald’s as well as Henry’s dictionaries are fiance specific. Thus, it is expected that they will produce more accurate scores. Furthermore, the SentimentAnalysis package is used to fit a custom sentiment dictionary by setting cumulative returns as the response variable. Finally, sentimentr package is used to find sentiment whilst taking into consideration negation and inflection. The default dictionary is used with this package as well as a custom one based on the loughran mcdonald dictionary dictionary from the tidytext. Sentimentr function is an faster and more accurate alternative to the qdap polarity function created by the same author.
#nrc
#grab sum of emotional words to normalize emotions in next step
total <- tokens %>%
inner_join(get_sentiments("nrc")) %>%
filter(!(sentiment %in% c("positive", "negative"))) %>%
group_by(doc_id) %>%
count(doc_id, sentiment) %>%
summarize(total = sum(n)) %>%
pull(total)
#polarity + emotions
sent <- tokens %>%
inner_join(get_sentiments("nrc")) %>%
count(doc_id,sentiment) %>%
pivot_wider(names_from=sentiment, values_from=n) %>%
mutate(sentiment_nrc = (positive - negative)/(positive+negative)) %>%
select(-negative, -positive) %>%
mutate(across(c(2:9), .fns = ~./total))
#bing
sent <- tokens %>%
inner_join(get_sentiments("bing")) %>%
count(doc_id,sentiment) %>%
pivot_wider(names_from = sentiment,values_from = n) %>%
mutate(sentiment_bing = (positive-negative)/(positive+negative)) %>%
select(-negative, -positive) %>%
inner_join(sent)
#afin
sent <- tokens %>%
inner_join(get_sentiments("afinn")) %>%
group_by(doc_id) %>%
summarise(sentiment_afinn = sum(value)) %>%
inner_join(sent)
#sentimentr package
#default dictionary
text_with_punctuation <- readRDS("text_with_punctuation.rds")
text_with_punctuation <- sentimentr::get_sentences(text_with_punctuation)
sentimentr <- sentimentr::sentiment_by(text_with_punctuation)
sent$sentimentr <- sentimentr$ave_sentiment
#LM dictionary
lm <- tidytext::get_sentiments("loughran")
lm_key <- data.frame(
words = lm$word,
polarity = ifelse(lm$sentiment == "positive", 1, -1),
stringsAsFactors = T
)
lm_key <- sentimentr::as_key(lm_key)
sentimentr_lm <- sentimentr::sentiment_by(text_with_punctuation, polarity_dt = lm_key)
sent$sentimentr_lm <- sentimentr_lm$ave_sentiment
#LM, Gi, HE, Qdap sentiment dictionaries frin SentimentAnalysis package
sentiment <- SentimentAnalysis::analyzeSentiment(master$documents_pos_tagged,
stemming=FALSE, removeStopwords=FALSE)
sentiment$WordCount <- NULL
sent <- cbind(sent,sentiment)
#SentimentAnalysis package also provides the ability to create custom dictionary
#this is done by aligning words to response variable
cust_dict <- master %>%
inner_join(indicators) %>%
drop_na(cum_abnormal_return_next_day) %$%
generateDictionary(documents_pos_tagged,cum_abnormal_return_next_day,
modelType="ridge", family="binomial")
cust_dict$intercept <- NULL
cust_dict <- as.data.frame(matrix(unlist(cust_dict), nrow=length(unlist(cust_dict[1]))))
names(cust_dict) <- c("word", "sentiment", "idf")
cust_dict$sentiment <- as.numeric(cust_dict$sentiment)
sent <- tokens %>%
inner_join(cust_dict) %>%
group_by(doc_id) %>%
summarise(sentiment_custom = sum(sentiment)) %>%
inner_join(sent, by="doc_id")
#calculate add sentiment change columns to sent df
ids <- master %>% select(cik, date, doc_id)
sent_diff <- function(sentiment) {
sentiment_change <- sentiment - lag(sentiment)
}
sent <- sent %>%
inner_join(ids) %>%
mutate(cik = factor(cik)) %>%
group_by(cik) %>%
arrange(date) %>%
mutate(across(where(is.numeric), list(change = ~ sent_diff(.))))
saveRDS(sent, "sent.rds")
rm(cust_dict, sentiment, lm, ids, lm_key, sent_diff,
text_with_punctuation, sentimentr_lm, tokens)
meta <- master %>% select(doc_id, GICS_Sub_Industry)
data_for_regression <- sent %>%
inner_join(indicators, by = c("doc_id", "date")) %>%
inner_join(meta, by = "doc_id") %>%
mutate(year = year(date)) %>%
left_join(fundamental_indicators, by=c("Symbol", "year")) %>%
mutate(cik = factor(cik)) %>%
mutate(GICS_Sub_Industry = factor(GICS_Sub_Industry))
In this dataset multiple entities are observed across time. Simple cross sectional analysis does not consider unobserved heterogeneity among companies, because firms might have underlying fixed factors not captured by our model. This could be anything from brand value to employee satisfaction. In other words, the data is panel data. To create a robust model this must be controlled for. This can be done through the plm package or manually. Specifying the model manually with the lm package enables further analysis with cross-validation which is not as straightforward with plm. Cross-validation allows us to check whether the model is overfitting and its general predictive power.
# function to store cross validation results in tidy format
store_cv <- function(cv, y, x, to) {
cv$y <- y
cv$x <- x
temp <- data.frame(model = unlist(cv))
temp$measure <- rownames(temp)
temp <- pivot_wider(temp,names_from = measure, values_from = model)
return(rbind(to, temp))
}
Function regression2DwithCVis takes a list of dependent variables, independent variables, and control variables. Regression is done on each pair of dependent and independent variables whilst controlling for specified variables. 10-Fold cross validation is preformed and results stored using function defined above. Stargazer is used to display the results for regression and cross validation.
cv_resutls <- data.frame()
regression2DwithCV <- function(dv_names, iv_names, controls, name) {
sentiment_models <- list()
for (y in dv_names){
for (x in iv_names) {
x <- paste(x, control, sep = "+")
form <- formula(paste(y, "~", x))
model <- lm(form, data=data_for_regression, x = TRUE, y = TRUE)
sentiment_models[[y]][[x]] <- model
cv_model <- cv.lm(model, k = 10, seed = 123, max_cores = detectCores() - 1)
cv_resutls <<- store_cv(cv_model, y, x, cv_resutls)
}
}
for (y in sentiment_models) {
#link <- combURL("PartB", paste0(all.vars(formula(y[[1]]))[1], "_", name, ".html"))
#print(name)
#stargazer::stargazer(y, type = "html", omit="cik", out = link)
#print(tab_model(y, collapse.ci = TRUE, collapse.se = TRUE))
}
}
control <- c("GICS_Sub_Industry")
control <- paste(control, collapse = '+')
dv_names <- c("cum_abnormal_return_next_week", "avg_abnormal_return_next_week",
"avg_abnormal_return_next_day", "cum_abnormal_return_next_day",
"avg_abnormal_volume_next_day","cum_abnormal_volume_next_day",
"volume_direction_next_day", "variance_next_week")
iv_names <- c("roe", "roa", "debt_equity_ratio",
"eps_earnings_per_share_diluted",
"pe_ratio", "roc", "ma7", "rsi", "obv")
control <- c("GICS_Sub_Industry")
control <- paste(control, collapse = '+')
regression2DwithCV(dv_names,iv_names, control, name="base_model-1")
iv_names <- c("roe + ma7", "roa + ma7", "debt_equity_ratio + ma7",
"eps_earnings_per_share_diluted + ma7",
"pe_ratio + ma7", "roc + ma7", "rsi + ma7", "obv + ma7")
regression2DwithCV(dv_names,iv_names, control, name="base_model-2")
control <- c("cik", "GICS_Sub_Industry", "debt_equity_ratio", "ma7")
control <- paste(control, collapse = '+')
iv_names <- c("sentiment_nrc",
"sentiment_bing",
"sentiment_afinn",
"SentimentGI",
"SentimentHE",
"SentimentLM",
"SentimentQDAP",
"RatioUncertaintyLM",
"sentimentr_lm",
"sentimentr",
"sentiment_custom")
regression2DwithCV(dv_names,iv_names, control, name="just_sentiment")
iv_names <- c("sentiment_nrc_change",
"sentiment_bing_change",
"sentiment_afinn_change",
"SentimentGI_change",
"SentimentHE_change",
"SentimentLM_change",
"SentimentQDAP_change",
"RatioUncertaintyLM_change",
"sentimentr_lm_change",
"sentimentr_change",
"sentiment_custom_change")
regression2DwithCV(dv_names,iv_names, control, name="sentiment_change")
iv_names <- c("NegativityGI",
"PositivityGI",
"NegativityHE",
"PositivityHE",
"PositivityLM",
"NegativityLM",
"NegativityQDAP",
"PositivityQDAP")
regression2DwithCV(dv_names,iv_names, control, name="neg_pos")
iv_names <- c("NegativityGI_change",
"PositivityGI_change",
"NegativityHE_change",
"PositivityHE_change",
"PositivityLM_change",
"NegativityLM_change",
"NegativityQDAP_change",
"PositivityQDAP_change")
regression2DwithCV(dv_names,iv_names, control, name="neg_pos_change")
iv_names <- c("anger",
"anticipation",
"disgust",
"fear",
"joy",
"sadness",
"surprise")
regression2DwithCV(dv_names,iv_names, control, name="emotion")
regression2DwithCV(dv_names, paste(iv_names, collapse = '+' ), control)
iv_names <- c("anger_change",
"anticipation_change",
"disgust_change",
"fear_change",
"joy_change",
"sadness_change",
"surprise_change")
regression2DwithCV(dv_names,iv_names, control, name="emotion_change")
regression2DwithCV(dv_names, paste(iv_names, collapse = '+' ), control)
iv_names <- c("SentimentLM",
"NegativityLM",
"disgust",
"disgust_change",
"sadness_change",
"sentiment_nrc")
regression2DwithCV(dv_names, paste(iv_names, collapse = '+' ), control, name="combined")
saveRDS(cv_resutls, "PartB/cv_results.rds")
cv_results <- readRDS("PartB/cv_results.rds")
cvs <- cv_results %>%
group_by(y) %>%
arrange(MAE.mean) %>%
top_n(5) %>%
select(MAE.mean, MAE.sd, y, x) %>%
group_split()
## Selecting by x
#%>%
#knitr::kable() %>%
#kable_styling(position = "center")
cvs <- lapply(cvs, as.data.frame)
#stargazer::stargazer(cvs, summary = rep(F,length(cvs)), type = "text",no.space=TRUE)
for (table in cvs) {
print(knitr::kable(table))
}
MAE.mean | MAE.sd | y | x |
---|---|---|---|
0.0091782386533657 | 0.00150966428059187 | avg_abnormal_return_next_day | surprise_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.00918275577718206 | 0.0014864845553651 | avg_abnormal_return_next_day | sentimentr_lm_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.00937887457994281 | 0.00157252778146152 | avg_abnormal_return_next_day | sentimentr_lm+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.00942693109529079 | 0.00163577620450724 | avg_abnormal_return_next_day | surprise+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.00944016648277091 | 0.00157738589935523 | avg_abnormal_return_next_day | sentimentr+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
MAE.mean | MAE.sd | y | x |
---|---|---|---|
0.00488252030219267 | 0.000792097772451372 | avg_abnormal_return_next_week | sentimentr_lm+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.00488480213504792 | 0.000827350956256574 | avg_abnormal_return_next_week | sentimentr+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.0048971316189137 | 0.000803705505237141 | avg_abnormal_return_next_week | surprise+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.00506523396665765 | 0.00101059136594638 | avg_abnormal_return_next_week | sentimentr_lm_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.00511027038066066 | 0.000973603470904424 | avg_abnormal_return_next_week | surprise_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
MAE.mean | MAE.sd | y | x |
---|---|---|---|
0.285823179466672 | 0.0332135958018523 | avg_abnormal_volume_next_day | sentimentr_lm_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.286055675928028 | 0.0305144248874301 | avg_abnormal_volume_next_day | sentimentr_lm+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.286720532002201 | 0.0303513796748044 | avg_abnormal_volume_next_day | surprise+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.28773818023958 | 0.02961373174355 | avg_abnormal_volume_next_day | sentimentr+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.289037804342597 | 0.034031508505347 | avg_abnormal_volume_next_day | surprise_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
MAE.mean | MAE.sd | y | x |
---|---|---|---|
0.0183564773067314 | 0.00301932856118374 | cum_abnormal_return_next_day | surprise_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.0183655115543641 | 0.00297296911073021 | cum_abnormal_return_next_day | sentimentr_lm_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.0187577491598856 | 0.00314505556292304 | cum_abnormal_return_next_day | sentimentr_lm+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.0188538621905816 | 0.00327155240901447 | cum_abnormal_return_next_day | surprise+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.0188803329655418 | 0.00315477179871046 | cum_abnormal_return_next_day | sentimentr+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
MAE.mean | MAE.sd | y | x |
---|---|---|---|
0.0246264358413286 | 0.00396814378728267 | cum_abnormal_return_next_week | sentimentr_lm+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.0246437007683787 | 0.00415640509804959 | cum_abnormal_return_next_week | sentimentr+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.024726079602924 | 0.0040109198300447 | cum_abnormal_return_next_week | surprise+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.0255328776878948 | 0.0051408946575127 | cum_abnormal_return_next_week | sentimentr_lm_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.0257899431787833 | 0.00492203229343267 | cum_abnormal_return_next_week | surprise_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
MAE.mean | MAE.sd | y | x |
---|---|---|---|
0.571646358933344 | 0.0664271916037046 | cum_abnormal_volume_next_day | sentimentr_lm_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.572111351856057 | 0.0610288497748603 | cum_abnormal_volume_next_day | sentimentr_lm+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.573441064004402 | 0.0607027593496088 | cum_abnormal_volume_next_day | surprise+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.575476360479161 | 0.0592274634870999 | cum_abnormal_volume_next_day | sentimentr+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.578075608685193 | 0.0680630170106941 | cum_abnormal_volume_next_day | surprise_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
MAE.mean | MAE.sd | y | x |
---|---|---|---|
0.000230415809367263 | 6.85018531782221e-05 | variance_next_week | sentimentr_lm+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.000230964051486379 | 6.90107158300945e-05 | variance_next_week | sentimentr+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.000231354207990225 | 6.22479921024627e-05 | variance_next_week | surprise+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.000232941853469224 | 4.48720128309396e-05 | variance_next_week | surprise_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.000236066889363565 | 3.90054950919969e-05 | variance_next_week | sentimentr_lm_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
MAE.mean | MAE.sd | y | x |
---|---|---|---|
0.409198809719714 | 0.0395613828604193 | volume_direction_next_day | sentimentr_lm_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.409492450027264 | 0.0423304681838998 | volume_direction_next_day | surprise_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.420026866804643 | 0.0460438835979588 | volume_direction_next_day | surprise+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.420072765324612 | 0.0464012455806698 | volume_direction_next_day | sentimentr_lm+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
0.420807588254073 | 0.0455501247626208 | volume_direction_next_day | sentimentr+cik+GICS_Sub_Industry+debt_equity_ratio+ma7 |
Surprise_change is best at predicting cum_abnormal_return_next_day and avg_abnormal_return_next_day. Sentimentr_lm and sentimentr_lm_change is leads in terms of predictive power for all other financial indicators. Note, however, none of these variables are statisticaly significant according to regression results. For detailed results of regression analysis see Appendix.
control <- c("cik", "GICS_Sub_Industry", "debt_equity_ratio", "ma7")
control <- paste(control, collapse = '+')
sentiment_models <- list()
dv_names_binary <- c("movment_direction")
iv_names <- c("sentiment_nrc",
"sentiment_bing",
"sentiment_afinn",
"SentimentGI",
"SentimentHE",
"SentimentLM",
"SentimentQDAP",
"RatioUncertaintyLM",
"sentimentr",
"NegativityGI",
"PositivityGI",
"NegativityHE",
"PositivityHE",
"PositivityLM",
"NegativityLM",
"NegativityQDAP",
"PositivityQDAP",
"sentiment_nrc_change",
"sentiment_bing_change",
"sentiment_afinn_change",
"SentimentGI_change",
"SentimentHE_change",
"SentimentLM_change",
"SentimentQDAP_change",
"RatioUncertaintyLM_change",
"sentimentr_change",
"NegativityGI_change",
"PositivityGI_change",
"NegativityHE_change",
"PositivityHE_change",
"PositivityLM_change",
"NegativityLM_change",
"NegativityQDAP_change",
"PositivityQDAP_change")
for (y in dv_names_binary){
for (x in iv_names) {
x <- paste(x, control, sep = "+")
form <- formula(paste(y, "~", x))
sentiment_models[[y]][[x]] <- glm(form, data=data_for_regression, family = "binomial")
}
}
name <- "test"
for (y in sentiment_models) {
name <- combURL("PartB", paste0(all.vars(formula(y[[1]]))[1], "-", name, ".tex"))
stargazer::stargazer(y, type = "latex", out = name)
}
A topic is a bag of words where each word is assigned a probability of belonging to the topic. A document consists of multiple topics of varying proportions.
In this part of the study we use Structural Topic modeling to discover topics in the textual data. The main advantage of STM over LDA or CTM is that STM allows users to include document metadata into the model (Roberts, 2016). The topic proportions prevalence and topic content in a document can thus be associated with metadata.
The topics discussed by management depend on the industry the company is in as was clearly illustrated in Part A tf-idf token exploration. Likely the time at which the report was written also has an effect on topics mentioned. For instance in 2020 reports we expect topics related to hygiene to be more prevalent. Therefore we include sub-industry and temporal variables as stm model covariates. Primarily, we want to examine the relationship between the return on a stock and the topics discussed. Therefore cumulative abnormal returns are added as a prevalence factor. Additionally, company features such as price to earnings ratio and debt to equity ratio are included since they were found to be good predictors of abnormal returns. For instance, it is possible that management of a company with a higher debt to equity ratio will focus on debt in the management discussion section of the report. Technical indicators are not included since they only reflect the short term nature of a company’s price movements rather than some intrinsic company state. Therefore, there is no plausible mechanism by which they can affect topics discussed.
According to Stewart (2020) prevalence covariates are not particularly sensitive to the number of metadata used, however content covariates are. Therefore whilst adding a number of prevalence covariates appears reasonable altering numerous content covariates less so. Moreover, adding content covariates removes the ability to carry out Search K. Since, selecting k is perhaps the most important decision in this type of analysis, no content covariates are added.
Spectral initialization is recommended by Roberts et al., (2016). It outperforms LDA and random initialization. Spectral initialization also returns consistent topics by focusing on anchor words (Mourtgos & Adams, 2019). According to stm documentation a rough optimal guess of optimal topic numbers is from up to 50 for a corpus of a few 100s documents. Therefore, search K in this study is carried out for k between 2 to 60.
Stm output is given in the form of words ranked by frequency, probability, lift and score in a topic. To label topics the main focus is on FREX measure which weights the frequency of a word’s appearance with its exclusiveness to the topic. A conceptually parallel to tf-idf can be drawn here.
meta <- master %>%
select(cik, GICS_Sub_Industry, documents_pos_tagged, date) %>%
mutate(year = year(date), cik = factor(cik), GICS_Sub_Industry = factor(GICS_Sub_Industry))
data_for_stm <- data_for_regression %>%
select(cum_abnormal_return_next_week, cum_abnormal_return_next_day, debt_equity_ratio, pe_ratio, cik, year) %>%
inner_join(meta, by=c("cik", "year")) %>% select(-cik, -date)
rm(data_for_regression)
processed <- textProcessor(data_for_stm$documents_pos_tagged,
metadata = data_for_stm,
stem = F)
## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Removing numbers...
## Creating Output...
threshold <- round(1/100 * length(processed$documents),0)
out <- prepDocuments(processed$documents,
processed$vocab,
processed$meta,
lower.thresh = threshold)
## Removing 64 of 3169 terms (166 of 203687 tokens) due to frequency
## Your corpus now has 326 documents, 3105 terms and 203521 tokens.
k_values = seq(from=2,to=60,by=2)
search_k_results <- searchK(out$documents, out$vocab, K=k_values,
N = floor(0.1*length(out$documents)),
prevalence = ~cum_abnormal_return_next_day +
factor(cik) +
s(year) +
GICS_Sub_Industry +
pe_ratio +
debt_equity_ratio,
cores = 2,
data=out$meta)
k_values = seq(from=2,to=12,by=1)
search_k_results_deep <- searchK(out$documents, out$vocab, K=k_values,
N = floor(0.1*length(out$documents)),
prevalence = ~cum_abnormal_return_next_day +
s(year) +
factor(cik) +
GICS_Sub_Industry +
pe_ratio +
debt_equity_ratio,
cores = 2,
data=out$meta)
k_values = seq(from=8,to=20,by=1)
search_k_results_deep <- searchK(out$documents, out$vocab, K=k_values,
N = floor(0.1*length(out$documents)),
prevalence = ~cum_abnormal_return_next_day +
s(year) +
factor(cik) +
GICS_Sub_Industry +
pe_ratio +
debt_equity_ratio,
cores = 2,
data=out$meta)
Search K
Unsurprisingly, semantic coherence is very high when few topics are present. This is due to the statistical properties of the technique (Mimno, 2011)(Roberts, 2014). Thus, peaks beyond the initial high values should be looked at. The leveling off of Held-Out Likelihoood and Lower bound curve as well as the through in the residual and peak in semantic coherence plot all suggest a optimal value for K equal to 9.
optimal_k <- 9
optimal_k_models <- selectModel(documents = out$documents,
vocab = out$vocab,
K = optimal_k,
prevalence = ~cum_abnormal_return_next_day +
s(year) +
GICS_Sub_Industry +
pe_ratio +
debt_equity_ratio,
max.em.its = 150,
gamma.prior='L1',
data = out$meta,
init.type = "Spectral",
ngroups = 5)
Comparing Models
For the most topics models have almost the same values for semantic coherence and exclusivity. However, for one of the topics model 2 & 3 outperform all others in terms of semantic coherence by a substantial margin. Model 2 is selected for futher alnalyis as it performs better on exclusivity in some of the other cases.
model <- optimal_k_models$runout[[2]]
save(model, file="PartC/stm_optimal_model.rda")
tidy_summary <- data.frame(FREX = do.call(paste,
c(as.data.frame(summary(model)$frex), sep=", ")),
Lift = do.call(paste,
c(as.data.frame(summary(model)$lift), sep=", ")),
Score = do.call(paste,
c(as.data.frame(summary(model)$score), sep=", ")),
Prob = do.call(paste,
c(as.data.frame(summary(model)$prob), sep=", "))) %>%
mutate(topic = row_number()) %>%
select(topic, everything())
## A topic model with 9 topics, 326 documents and a 3105 word dictionary.
## A topic model with 9 topics, 326 documents and a 3105 word dictionary.
## A topic model with 9 topics, 326 documents and a 3105 word dictionary.
## A topic model with 9 topics, 326 documents and a 3105 word dictionary.
labels <- c("Lawsuits", "Operationw Abroad" , "Software & Hardware",
"Accounting Standards", "Distribution and Manufacturing",
"Insurance", "Digital Royalty", "Card Rewards", "Subscription")
tidy_summary[,2:5] %>%
knitr::kable(booktabs = TRUE) %>%
pack_rows(index = paste("Topic", c(1:9), ":", labels)) %>%
kable_styling(latex_options = "scale_down")
FREX | Lift | Score | Prob |
---|---|---|---|
Topic 1 : Lawsuits | |||
class, litigation, court, escrow, complaint, interchange, plaintiff | accountants, covered, disagreements, circumvention, overriding, benefits, signatory | accountants, class, plaintiff, complaint, escrow, covered, defendant | company, asset, share, income, class, common, settlement |
Topic 2 : Operationw Abroad | |||
consumer, transfer, agent, region, negatively, versus, euro | reside, arbitrator, constructive, expectancy, generator, illicit, impediment | lines, consumer, border, check, segment, processing, agent | rate, revenue, income, currency, foreign, expense, business |
Topic 3 : Software & Hardware | |||
hardware, percent, support, software, update, system, premise | decreasing, deflationary, elements, optimistic, picture, post-combination, quantified | hardware, deflationary, support, subscription, software, middleware, license | revenue, service, software, expense, product, customer, support |
Topic 4 : Accounting Standards | |||
non-gaap, investor, mutual, company, proxy, communication, measure | academia, accountable, forecasts, imprecise, reaction, sub-section, biomedical | company, chemistry, non-gaap, proxy, earnings, perpetual, investor | company, revenue, income, expense, related, result, rate |
Topic 5 : Distribution and Manufacturing | |||
client, payroll, insurance, fund, processing, online, solution | intermediate, quicken, calculate, centralize, embezzlement, exhaustive, garnishment | client, centralize, payroll, worker, segment, processing, service | revenue, service, income, client, rate, total, business |
Topic 6 : Insurance | |||
wafer, distributor, inventory, fabrication, memory, manufacturing, shipment | advantages, averse, bifurcation, constantly, controllers, dense, enthusiast | wafer, distributor, fabrication, inventory, memory, gigabit, semiconductor | product, income, expense, rate, primarily, market, result |
Topic 7 : Digital Royalty | |||
digital, royalty, device, creative, media, wireless, circuit | authoring, acrobat, advertiser, cost-sensitive, foregone, hobbyist, localization | risky, subscription, wireless, creative, circuit, modem, acrobat | revenue, related, primarily, income, product, expense, increase |
Topic 8 : Card Rewards | |||
fuel, mile, label, reward, private, spread, redemption | accessory, accordion, apparel, bankrupt, branded, coalition, collector | fuel, mile, label, conduit, reward, fleet, grocery | credit, revenue, rate, increase, income, expense, transaction |
Topic 9 : Subscription | |||
maintenance, subscription, observable, billing, professional, input, privately | ample, convention, correct, diligent, drawdown, forint, freight | subscription, maintenance, forint, hardware, perpetual, upfront, seat | revenue, product, cost, increase, service, expense, asset |
tidy_gamma <- tidy(model, matrix = "gamma", document_names = rownames(out$meta))
tidy_gamma %>%
group_by(topic) %>%
summarise(gamma = mean(gamma)) %>%
arrange(desc(gamma)) %>%
mutate(topic = reorder(topic, gamma)) %>%
ggplot(aes(topic, gamma, label = labels, fill = topic)) +
geom_col(show.legend = FALSE, alpha = 0.8) +
geom_text(hjust = 1.2, nudge_y = 0.0005, size = 10, color='white') +
coord_flip() +
theme_light(base_size = 22) +
labs(x = NULL, y = expression(gamma),
title = "Top topics by prevalence in 10-K reports")
ggsave("PartC/topic_proportions_in_corpus.png",
width = 50, height = 35, units = "cm")
Topic Proportions across Industries
tidy_gamma %>%
pivot_wider(id_cols=document, names_from = topic, values_from = gamma) %>%
cbind(meta) %>%
select(-documents_pos_tagged, -year, -cik) %>%
group_by(GICS_Sub_Industry) %>%
summarise(across(where(is.numeric), mean)) %>%
pivot_longer(!GICS_Sub_Industry, names_to = "topic", values_to = "gamma") %>%
mutate(topic = factor(topic)) %>%
ggplot() +
geom_bar(aes(x = topic, y = gamma, fill =topic), alpha = 0.8, stat = "identity") +
facet_wrap(.~GICS_Sub_Industry) +
theme_light(base_size = 22) +
theme(strip.background=element_blank(),
strip.text=element_text(colour = 'black', face = "bold", size = 17)) +
xlab("Topic") + ylab("Mean Gamma") +
scale_fill_discrete(name="Legend",labels=labels)
ggsave("PartC/topic_proportions_across_industries.png",
width = 50, height = 35, units = "cm")
On this graph variation in topics across industries are seen.
Topic Proportions across Industries
#exclude all docs with prob less than 1 percent
#allows for better examination
ggplot(tidy_gamma %>% filter(gamma > 0.01),
aes(gamma, fill = as.factor(topic))) +
geom_histogram(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ topic, ncol = 3) +
labs(title = "Document probabilities distribution per topic",
y = "Number of reports", x = expression(gamma)) +
theme_light(base_size = 22)
ggsave("PartC/document_probabilities_distribution.png",
width = 45, height = 45, units = "cm")
Topic Proportions across Industries
Each topic is strongly associated with some of the documents and less so with other.
effects_return <- estimateEffect(1:optimal_k ~debt_equity_ratio, stmobj = model, meta = out$meta)
margin1 <- as.numeric(quantile(out$meta$debt_equity_ratio)[2])
margin2 <- as.numeric(quantile(out$meta$debt_equity_ratio)[4])
plot(effects_return, covariate = "debt_equity_ratio",
topics = 1:optimal_k,
model = model, method = "difference",
cov.value1 = margin2, cov.value2 = margin1,
xlab = "Low Debt ... High Debt",
xlim = c(-0.01,0.01),
main = "Marginal change on topic probabilities for low and high price",
custom.labels = labels,
ci.level = 0.05,
labeltype = "custom")
effects_return <- estimateEffect(1:optimal_k ~pe_ratio, stmobj = model, meta = out$meta)
margin1 <- as.numeric(quantile(out$meta$pe_ratio)[2])
margin2 <- as.numeric(quantile(out$meta$pe_ratio)[4])
plot(effects_return, covariate = "pe_ratio",
topics = 1:optimal_k,
model = model, method = "difference",
cov.value1 = margin2, cov.value2 = margin1,
xlab = "Low PE ... High PE",
xlim = c(-0.01,0.01),
main = "Marginal change on topic probabilities for low and high PE ratio",
custom.labels = labels,
ci.level = 0.05,
labeltype = "custom")
effects_return <- estimateEffect(1:optimal_k ~cum_abnormal_return_next_day, stmobj = model, meta = out$meta)
margin1 <- as.numeric(quantile(out$meta$cum_abnormal_return_next_week)[2])
margin2 <- as.numeric(quantile(out$meta$cum_abnormal_return_next_week)[4])
plot(effects_return, covariate = "cum_abnormal_return_next_day",
topics = 1:optimal_k,
model = model, method = "difference",
cov.value1 = margin2, cov.value2 = margin1,
xlab = "Low Price ... High Price",
xlim = c(-0.05,0.05),
main = "Marginal change on topic probabilities for low and high price",
custom.labels = labels,
ci.level = 0.05,
labeltype = "custom")
effects_year <- estimateEffect(1:optimal_k ~s(year), stmobj = model, meta = out$meta)
plot(effects_year, covariate = "year",
topics = 1:optimal_k,
model = model, method = "continuous",
xlab = "Past ... Present",
main = "Marginal change on topic probabilities across years",
custom.labels =labels,
ci.level = 0.05,
labeltype = "custom")
topic_correlations <- topicCorr(model)
plot.topicCorr(topic_correlations,
vlabels = labels,
vertex.color = "#CDF0EA",
vertex.label.cex =01,
vertex.size=30,
vertex.label.color="#053742")
These charts depict the marginal probability of topic prevalence as variable changes.
tidy_theta <- as.data.frame(model$theta)
colnames(tidy_theta) <- paste0("topic_",1:9)
tidy_theta <- cbind(out$meta,tidy_theta)
topics <- paste0("topic_", 1:9)
iv_names <- paste(c(topics , "cik"), collapse = " + ")
lm_model <- lm(paste("cum_abnormal_return_next_day ~", iv_names), tidy_theta)
tidy(lm_model) %>% kable() %>%
kable_styling(position = "center")
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.0045039 | 0.0093844 | -0.4799320 | 0.6316400 |
topic_1 | -0.0146199 | 0.0095478 | -1.5312351 | 0.1268090 |
topic_2 | 0.0064744 | 0.0068529 | 0.9447689 | 0.3455688 |
topic_3 | 0.0070618 | 0.0082127 | 0.8598637 | 0.3905794 |
topic_4 | 0.0112073 | 0.0087023 | 1.2878567 | 0.1988297 |
topic_5 | -0.0081481 | 0.0079406 | -1.0261292 | 0.3056918 |
topic_6 | -0.0000164 | 0.0058805 | -0.0027942 | 0.9977724 |
topic_7 | 0.0024948 | 0.0076385 | 0.3266045 | 0.7442043 |
topic_8 | 0.0037274 | 0.0079790 | 0.4671466 | 0.6407482 |
topic_9 | NA | NA | NA | NA |
cik4127 | 0.0025809 | 0.0116364 | 0.2217923 | 0.8246328 |
cik6281 | 0.0066921 | 0.0116626 | 0.5738114 | 0.5665433 |
cik723125 | 0.0079608 | 0.0115370 | 0.6900222 | 0.4907359 |
cik723531 | -0.0002769 | 0.0123163 | -0.0224848 | 0.9820768 |
cik743316 | -0.0024732 | 0.0117525 | -0.2104429 | 0.8334708 |
cik743988 | 0.0052937 | 0.0115885 | 0.4568068 | 0.6481543 |
cik769397 | 0.0162905 | 0.0116651 | 1.3965126 | 0.1636355 |
cik779152 | 0.0046571 | 0.0116263 | 0.4005662 | 0.6890365 |
cik796343 | -0.0005136 | 0.0116500 | -0.0440898 | 0.9648633 |
cik798354 | 0.0077926 | 0.0116638 | 0.6681061 | 0.5046009 |
cik804328 | 0.0012401 | 0.0116892 | 0.1060883 | 0.9155861 |
cik813672 | 0.0042170 | 0.0115978 | 0.3636017 | 0.7164222 |
cik827054 | 0.0068285 | 0.0115704 | 0.5901682 | 0.5555406 |
cik849399 | -0.0005927 | 0.0115451 | -0.0513378 | 0.9590919 |
cik877890 | 0.0086757 | 0.0116911 | 0.7420753 | 0.4586465 |
cik883241 | 0.0059439 | 0.0115503 | 0.5146102 | 0.6072201 |
cik896878 | 0.0052314 | 0.0116270 | 0.4499312 | 0.6530985 |
cik1013462 | -0.0076748 | 0.0116263 | -0.6601238 | 0.5097020 |
cik1045810 | -0.0085638 | 0.0116545 | -0.7348037 | 0.4630569 |
cik1101215 | -0.0043800 | 0.0116667 | -0.3754310 | 0.7076163 |
cik1108524 | 0.0070212 | 0.0116741 | 0.6014353 | 0.5480232 |
cik1123360 | -0.0122103 | 0.0118764 | -1.0281202 | 0.3047560 |
cik1136893 | 0.0076943 | 0.0115983 | 0.6634022 | 0.5076036 |
cik1141391 | 0.0034293 | 0.0116159 | 0.2952265 | 0.7680335 |
cik1175454 | 0.0043529 | 0.0119242 | 0.3650481 | 0.7153434 |
cik1341439 | 0.0146612 | 0.0115752 | 1.2666097 | 0.2063183 |
cik1365135 | 0.0033617 | 0.0117230 | 0.2867624 | 0.7745004 |
cik1383312 | 0.0322571 | 0.0116135 | 2.7775440 | 0.0058367 |
cik1403161 | -0.0001431 | 0.0115294 | -0.0124122 | 0.9901054 |
glance(lm_model) %>% kable() %>%
kable_styling(position = "center")
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
---|---|---|---|---|---|---|---|---|---|---|---|
0.1124705 | -0.0015524 | 0.0269775 | 0.9863856 | 0.4966375 | 37 | 735.3848 | -1392.77 | -1245.081 | 0.2096024 | 288 | 326 |
None of the topics are usefull in predicting abnormal returns.
auto_stm_model <- stm(documents = out$documents,
vocab = out$vocab,
K = 0,
prevalence =~cum_abnormal_return_next_day + s(year) +
factor(cik) +
GICS_Sub_Industry +
pe_ratio +
debt_equity_ratio,
max.em.its = 150,
gamma.prior='L1',
data = out$meta,
init.type = "Spectral",
ngroups = 5)
save(auto_stm_model, file="PartC/auto_stm_model.rda")
load("PartC/auto_stm_model.rda")
tidy_summary <- data.frame(FREX = do.call(paste, c(as.data.frame(summary(auto_stm_model)$frex), sep=", ")),
Lift = do.call(paste, c(as.data.frame(summary(auto_stm_model)$lift), sep=", ")),
Score = do.call(paste, c(as.data.frame(summary(auto_stm_model)$score), sep=", ")),
Prob = do.call(paste, c(as.data.frame(summary(auto_stm_model)$prob), sep=", "))) %>%
mutate(topic = row_number()) %>%
select(topic, everything())
## A topic model with 55 topics, 326 documents and a 3105 word dictionary.
## A topic model with 55 topics, 326 documents and a 3105 word dictionary.
## A topic model with 55 topics, 326 documents and a 3105 word dictionary.
## A topic model with 55 topics, 326 documents and a 3105 word dictionary.
tidy_summary[,2:5] %>%
knitr::kable(booktabs = TRUE) %>%
pack_rows(index = paste("Topic", c(1:55))) %>%
kable_styling(latex_options = "scale_down")
FREX | Lift | Score | Prob |
---|---|---|---|
Topic 1 | |||
executive, disclosure, control, decision, resource, addition, accounting | accountants, control, disclosure, executive, decision, officer, resource | accountants, control, executive, decision, disclosure, officer, resource | control, disclosure, executive, decision, addition, accounting, management |
Topic 2 | |||
class, court, plaintiff, complaint, defendant, benefits, motion | benefits, preliminarily, investigative, declaratory, settled, co-defendant, unlawful | benefits, class, plaintiff, complaint, defendant, escrow, court | company, asset, class, share, common, income, note |
Topic 3 | |||
client, payroll, insurance, fund, administration, worker, associate | centralize, topics, ancillary, attendance, paid, intermediate, garnishment | client, payroll, insurance, check, worker, centralize, remittance | client, service, investment, rate, income, fund, revenue |
Topic 4 | |||
limitation, procedure, circumvention, effectiveness, inherent, possibility, constraint | circumvention, procedure, accountant, constraint, error, misstatement, possibility | circumvention, procedure, effectiveness, absolute, error, constraint, control | control, reporting, reasonable, procedure, internal, limitation, effectiveness |
Topic 5 | |||
mutual, proxy, communication, earnings, advisor, consisting, investor | closed, nontaxable, post-retirement, newsletter, registrar, contests, non-trade | proxy, mutual, earnings, closed, advisor, non-gaap, investor | revenue, increase, company, expense, income, rate, earning |
Topic 6 | |||
accompanying, merger, check, line, unfavorable, trademark, network | combat, pronouncements, lines, restrictive, commits, incoming, membership | combat, check, accompanying, merger, vertical, trademark, membership | income, credit, revenue, asset, rate, period, expense |
Topic 7 | |||
update, hardware, support, software, comparison, education, license | computationally, annualize, arranger, post-combination, middleware, shortly, codification | hardware, middleware, update, computationally, software, education, support | revenue, product, software, expense, hardware, support, rate |
Topic 8 | |||
mutual, earnings, proxy, outsourcing, distribution, communication, broker | correspondent, weights, non-compliance, securities, archival, piece, midrange | correspondent, earnings, proxy, mutual, outsourcing, broker, client | revenue, company, operation, service, income, agreement, increase |
Topic 9 | |||
description, emulation, warrant, maintenance, hardware, restructuring, conversion | cryptocurrency, convention, industrialized, product-specific, blueprint, logical, contemplation | hardware, emulation, maintenance, cryptocurrency, description, warrant, conversion | product, revenue, change, note, cost, related, asset |
Topic 10 | |||
class, pension, nominal, claim, sustainable, convert, differential | defendants, honor, panel, violate, complaints, pre-trial, omnibus | defendants, class, interchange, escrow, pension, client, litigation | company, income, asset, share, class, common, note |
Topic 11 | |||
implementation, standard, standalone, stream, practical, delivery, outsourcing | deflationary, randomly, multi-element, organizations, non-essential, invested, survey | deflationary, survey, outsourcing, processing, hardware, remittance, maintenance | revenue, service, cost, customer, related, company, product |
Topic 12 | |||
microprocessor, graphics, amendment, indenture, chipset, processor, shipment | averse, derivation, enthusiast, freedom, meaningfully, dense, semi-custom | microprocessor, graphics, wafer, dense, shipment, semi-custom, chipset | expense, related, primarily, amount, product, income, decrease |
Topic 13 | |||
system, independent, report, management, event, effective, accounting | disagreements, system, independent, public, management, report, control | system, disagreements, independent, report, public, control, internal | system, independent, management, report, future, accounting, effective |
Topic 14 | |||
distributor, signal, assembly, microcontroller, debenture, semiconductor, capacity | distributors, inappropriate, offshore, rare, interface, uncommon, signal | distributor, microcontroller, wafer, debenture, semiconductor, assembly, memory | product, distributor, approximately, income, cost, acquisition, amount |
Topic 15 | |||
insurance, client, payroll, worker, worksite, fund, administration | embezzlement, flex, worksite, usual, facing, renovation, overnight | client, insurance, embezzlement, worksite, worker, payroll, non-gaap | client, service, income, rate, investment, fund, share |
Topic 16 | |||
return, allowances, pre-tax, research, title, derivative, contingent | breakout, end-customer, exclusivity, indicators, non-warranty, differences, post-shipment | exclusivity, inventory, distributor, wafer, non-warranty, research, shipment | income, asset, revenue, expense, primarily, rate, product |
Topic 17 | |||
maintenance, perpetual, license, upfront, professional, chip, criterion | ample, hotline, drawdown, forint, fronts, mentioned, post-customer | maintenance, perpetual, upfront, hardware, shipment, chip, functionally | revenue, license, increase, customer, term, service, cost |
Topic 18 | |||
fuel, wholesale, organic, fleet, spread, macroeconomic, network | toll, undivided, acceptability, diagram, expansive, fleets, gallon | fuel, wholesale, fleet, organic, transportation, spread, gallon | revenue, income, transaction, rate, fuel, facility, impact |
Topic 19 | |||
mile, reward, label, database, private, cardholder, redemption | grocery, fashion, woman, furnishings, email, trusts, apparel | mile, reward, label, cardholder, database, breakage, sponsor | credit, increase, rate, service, revenue, expense, mile |
Topic 20 | |||
seat, geography, non-gaap, subscription, reseller, maintenance, suite | curricula, deploy, digitally, disciplinary, downloadable, educator, expositions | subscription, non-gaap, seat, maintenance, horizontal, geography, reseller | revenue, product, increase, expense, business, cost, primarily |
Topic 21 | |||
consumer, agent, transfer, location, region, rating, paper | intra-country, re-balancing, uncertainties, imposing, migrant, constructive, consumers | intra-country, consumer, agent, border, rating, transfer, region | revenue, rate, business, transaction, consumer, foreign, income |
Topic 22 | |||
non-gaap, simulation, perpetual, maintenance, operational, lease, investor | sub-section, academia, accountable, biomedical, chemical, chemistry, copyright | non-gaap, maintenance, perpetual, company, simulation, investor, matrix | company, revenue, income, expense, related, result, rate |
Topic 23 | |||
label, private, mile, reward, redemption, sponsor, collector | merchandise, harbor, catalog, label, year, collector, permission | mile, label, reward, private, collector, breakage, merchandise | credit, service, revenue, rate, mile, private, reward |
Topic 24 | |||
circuit, wireless, device, royalty, spectrum, marketable, patent | multimode, -process, circuits, codec, laptops, messaging, nonfunctional | wireless, circuit, device, multimode, licensee, royalty, marketable | revenue, related, rate, product, primarily, increase, asset |
Topic 25 | |||
communication, proxy, wealth, mutual, investor, retirement, earnings | multi-asset, borrowed, wealth, post-employment, mailing, sell, clearance | multi-asset, wealth, proxy, earnings, mutual, non-gaap, investor | company, revenue, service, management, income, activity, asset |
Topic 26 | |||
percent, subscription, professional, invoice, renewal, billing, non-gaap | co-location, contributor, motivate, multi-tenant, undeveloped, impending, parking | subscription, non-gaap, percent, absolute, invoice, multi-tenant, billing | revenue, service, expense, customer, total, percent, increase |
Topic 27 | |||
border, rebate, euro, versus, issuer, local, incentive | nonstandard, numerical, reconcile, non-european, encouraging, neutral, warning | neutral, border, rebate, euro, versus, cardholder, non-gaap | currency, expense, revenue, foreign, income, rate, customer |
Topic 28 | |||
premise, hardware, license, index, infrastructure, support, swap | non-oracle, summation, interoperable, interest, host, firmware, upward | hardware, premise, non-oracle, index, swap, marketable, deployment | revenue, license, service, expense, rate, currency, hardware |
Topic 29 | |||
debit, guarantee, check, intrusion, ticket, channel, associations | ticket, non-routine, resell, restaurants, intrusion, associations, dishonored | check, ticket, debit, intrusion, non-routine, associations, interchange | service, credit, facility, rate, income, revenue, loss |
Topic 30 | |||
banking, check, consolidation, swap, incremental, variance, processing | corrupt, disrupt, non-traditional, sophistication, steal, surviving, detrimental | check, banking, client, earnings, processing, swap, non-traditional | revenue, service, rate, operation, income, period, business |
Topic 31 | |||
comprehensive, damage, court, objection, derivative, liabilities, incentive | objection, unadjusted, argument, appellate, alleged, intra-entity, allegedly | objection, interchange, complaint, damage, plaintiff, class, court | asset, income, company, loss, share, liability, foreign |
Topic 32 | |||
overriding, internal, reporting, procedure, effectiveness, control, misstatement | overriding, inadequate, misstatement, detection, authorizations, procedure, fairly | overriding, procedure, misstatement, reporting, control, effectiveness, internal | control, internal, reporting, management, share, procedure, acquisition |
Topic 33 | |||
storage, content, server, authoritative, joint, indemnification, undelivered | controversy, authentication, liberal, outweigh, partnering, taxpaying, tolling | subscription, storage, partnering, authoritative, server, rebate, content | revenue, asset, amount, income, cost, expense, primarily |
Topic 34 | |||
profit, reserve, material, military, equivalent, research, uncertainty | amplifier, precision, revolution, smartphones, multitude, pascal, prohibitive | pascal, military, inventory, cellular, research, medical, erosion | expense, related, revenue, rate, income, result, asset |
Topic 35 | |||
desktop, investments, online, segment, staffing, unsecured, payroll | exhaustive, nimbly, patient, quicken, suspicious, employ, pre-established | desktop, patient, quicken, staffing, segment, payroll, online | revenue, income, service, business, expense, segment, total |
Topic 36 | |||
creative, developer, redundant, stable, prepayment, termination, restructuring | shippable, foregone, perpetually, hobbyist, redundant, download, localization | creative, perpetually, acrobat, redundant, developer, subscription, element | revenue, product, related, primarily, income, expense, cost |
Topic 37 | |||
digital, media, subscription, document, creative, backlog, offering | personalize, cost-sensitive, subscribe, syncing, advertiser, trajectory, photography | subscription, media, digital, creative, document, perpetual, personalize | revenue, increase, income, digital, primarily, foreign, subscription |
Topic 38 | |||
implementation, outsourced, hardware, complementary, support, electronic, element | picture, decreasing, optimistic, unwavering, non-exclusive, aircraft, bullet | hardware, outsourced, outsourcing, element, picture, installation, complementary | revenue, service, cost, customer, support, product, software |
Topic 39 | |||
hardware, update, software, premise, support, comparison, subscription | protect, -aservice, elements, instructor, perfunctory, rational, agility | hardware, subscription, update, software, premise, support, storage | software, revenue, product, hardware, support, service, expense |
Topic 40 | |||
distributor, microcontroller, assembly, half, auction, capacity, debenture | purely, serial, proposal, fail, pre-determined, non-proprietary, unrelated | distributor, microcontroller, purely, wafer, debenture, memory, inventory | product, distributor, market, income, result, investment, approximately |
Topic 41 | |||
restructure, realizable, dram, volume, decline, equipment, manufacture | re-use, restructure, forecasting, outpace, rolling, qualification, non-trade | restructure, dram, outpace, realizable, memory, qualification, inventories | product, primarily, cost, amount, income, increase, rate |
Topic 42 | |||
unsecured, contingent, special, action, employment, demand, distributor | instrumentation, roadmaps, tenor, sizing, sold, injury, predominant | distributor, inventory, predominant, roadmaps, industrial, categorization, shipment | result, income, rate, increase, product, revenue, amount |
Topic 43 | |||
consumer, transfer, negatively, agent, region, paper, strengthening | arbitrator, interconnected, multi-strategy, varied, expectancy, saving, illicit | consumer, saving, agent, strengthening, peso, pension, region | rate, revenue, currency, foreign, income, consumer, business |
Topic 44 | |||
division, input, workspace, observable, authoritative, collaboration, virtualization | duplication, login, mitigate, observation, password, shrink, turnaround | workspace, division, subscription, authoritative, desktop, maintenance, virtualization | product, revenue, service, related, primarily, asset, cost |
Topic 45 | |||
covered, retrospective, litigation, responsibility, escrow, settlement, interchange | sponsoring, covered, misstatements, responsibility, retrospective, escrow, non-controll | covered, sponsoring, responsibility, litigation, escrow, interchange, retrospective | litigation, covered, retrospective, note, responsibility, settlement, provision |
Topic 46 | |||
transition, provisional, quantitative, pandemic, adoption, distinct, form | stockholders, coronavirus, unsatisfied, pandemic, shutdown, non-distributor, capable | stockholders, provisional, pandemic, distinct, perpetual, transition, enactment | income, change, obligation, amount, rate, time, transition |
Topic 47 | |||
company, report, presentation, management, statement, event, assumption | supervision, company, presentation, reclassification, principle, public, report | supervision, company, translation, report, indefinite, audit, public | company, management, statement, report, future, acquisition, accounting |
Topic 48 | |||
notebook, architecture, processor, workstation, game, marketable, warranty | builder, custodian, motherboard, multi-core, municipality, navigation, recall | inventory, notebook, marketable, rebate, processor, graphics, visual | product, revenue, income, market, cost, related, expense |
Topic 49 | |||
processing, senior, institution, unconsolidated, subscriber, banking, electronic | acumen, distinction, sizable, accompany, convey, mild, perception | client, processing, transit, unconsolidated, subscriber, banking, thrift | revenue, service, rate, income, payment, business, expense |
Topic 50 | |||
gigabit, dram, memory, venture, flash, production, joint | underutilized, severe, tech, multi-chip, density, verdict, successively | gigabit, underutilized, dram, memory, flash, wafer, tech | product, cost, primarily, result, average, expense, acquisition |
Topic 51 | |||
assembly, microcontroller, distributor, signal, fabrication, wafer, semiconductor | uninsured, gate, virus, dispersion, expirations, wider, adopt | distributor, microcontroller, uninsured, wafer, assembly, fabrication, semiconductor | product, rate, acquisition, distributor, customer, facility, result |
Topic 52 | |||
redemption, loyalty, conduit, deposit, mile, consent, offs | unredeemed, prevailing, expiry, eliminations, restitution, non-executive, regression | mile, unredeemed, loyalty, conduit, reward, expiry, redemption | credit, increase, rate, expense, revenue, asset, program |
Topic 53 | |||
gigabit, dram, supply, flash, memory, output, production | width, creditor, gigabits, gigabit, successively, yuan, wind | gigabit, dram, memory, wafer, flash, fabrication, width | product, cost, result, primarily, agreement, amount, average |
Topic 54 | |||
comparable, debenture, broadcast, family, shipment, distributor, mainstream | foreign-currency, withhold, prom, lengthy, published, salable, wireline | debenture, distributor, broadcast, shipment, inventory, wireless, mainstream | revenue, product, income, period, rate, market, increase |
Topic 55 | |||
divestiture, unallocated, identity, protection, enterprise, billing, transition | writing, divestiture, identity, unallocated, non-operat, exceptions, varying | divestiture, writing, identity, unallocated, billing, protection, metric | revenue, primarily, income, expense, result, cost, operation |
tidy_theta <- as.data.frame(auto_stm_model$theta)
colnames(tidy_theta) <- paste0("topic_",1:55)
tidy_theta <- cbind(out$meta,tidy_theta)
topics <- paste0("topic_", 1:55)
iv_names <- paste(c(topics , "cik"), collapse = " + ")
lm_model <- lm(paste("cum_abnormal_return_next_day ~", iv_names), tidy_theta)
tidy(lm_model) %>% kable() %>%
kable_styling(position = "center")
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.0342238 | 0.0174691 | -1.9591011 | 0.0512492 |
topic_1 | -0.1601652 | 0.9372345 | -0.1708913 | 0.8644521 |
topic_2 | 0.0161565 | 0.0270823 | 0.5965688 | 0.5513530 |
topic_3 | 0.0283555 | 0.0201120 | 1.4098833 | 0.1598580 |
topic_4 | 0.1070051 | 0.1016611 | 1.0525676 | 0.2935890 |
topic_5 | 0.0349806 | 0.0230135 | 1.5200022 | 0.1298158 |
topic_6 | 0.0164919 | 0.0216484 | 0.7618089 | 0.4469159 |
topic_7 | 0.0260029 | 0.0236328 | 1.1002862 | 0.2723007 |
topic_8 | 0.0494177 | 0.0206808 | 2.3895485 | 0.0176355 |
topic_9 | 0.0255111 | 0.0182664 | 1.3966191 | 0.1638077 |
topic_10 | 0.0571828 | 0.0258001 | 2.2163796 | 0.0275961 |
topic_11 | 0.0660352 | 0.0213440 | 3.0938550 | 0.0022079 |
topic_12 | 0.0261464 | 0.0176575 | 1.4807478 | 0.1399745 |
topic_13 | -0.6844999 | 1.1688602 | -0.5856132 | 0.5586812 |
topic_14 | 0.0236586 | 0.0208292 | 1.1358359 | 0.2571489 |
topic_15 | 0.1090007 | 0.0244369 | 4.4605055 | 0.0000125 |
topic_16 | 0.0316088 | 0.0186996 | 1.6903499 | 0.0922486 |
topic_17 | 0.0282858 | 0.0173990 | 1.6257179 | 0.1053112 |
topic_18 | 0.0365042 | 0.0174952 | 2.0865307 | 0.0379777 |
topic_19 | -0.0001363 | 0.0226825 | -0.0060069 | 0.9952121 |
topic_20 | 0.0376618 | 0.0174977 | 2.1523796 | 0.0323567 |
topic_21 | 0.0335991 | 0.0217613 | 1.5439838 | 0.1238989 |
topic_22 | 0.0380281 | 0.0173324 | 2.1940433 | 0.0291836 |
topic_23 | 0.0168625 | 0.0234769 | 0.7182595 | 0.4732902 |
topic_24 | 0.0348916 | 0.0169743 | 2.0555549 | 0.0408986 |
topic_25 | 0.0273443 | 0.0253257 | 1.0797059 | 0.2813481 |
topic_26 | 0.0312590 | 0.0176186 | 1.7742063 | 0.0772863 |
topic_27 | 0.0209410 | 0.0184292 | 1.1362966 | 0.2569564 |
topic_28 | 0.0677562 | 0.0222270 | 3.0483765 | 0.0025563 |
topic_29 | 0.0219950 | 0.0204805 | 1.0739480 | 0.2839157 |
topic_30 | 0.0345708 | 0.0176023 | 1.9639920 | 0.0506757 |
topic_31 | -0.0021999 | 0.0335033 | -0.0656622 | 0.9477010 |
topic_32 | 0.0047848 | 0.1118821 | 0.0427665 | 0.9659229 |
topic_33 | 0.0281372 | 0.0189613 | 1.4839287 | 0.1391291 |
topic_34 | 0.0319834 | 0.0173598 | 1.8423853 | 0.0666421 |
topic_35 | 0.0279150 | 0.0171583 | 1.6269065 | 0.1050583 |
topic_36 | 0.0127701 | 0.0201107 | 0.6349897 | 0.5260350 |
topic_37 | 0.0346392 | 0.0207871 | 1.6663786 | 0.0969318 |
topic_38 | 0.0388209 | 0.0189901 | 2.0442700 | 0.0420094 |
topic_39 | 0.0365428 | 0.0265403 | 1.3768774 | 0.1698225 |
topic_40 | 0.0172476 | 0.0261905 | 0.6585459 | 0.5108135 |
topic_41 | 0.0377398 | 0.0203535 | 1.8542170 | 0.0649248 |
topic_42 | 0.0248161 | 0.0180613 | 1.3739901 | 0.1707160 |
topic_43 | 0.0331811 | 0.0189071 | 1.7549500 | 0.0805333 |
topic_44 | 0.0364368 | 0.0178258 | 2.0440551 | 0.0420308 |
topic_45 | 0.0994105 | 0.1427843 | 0.6962284 | 0.4869539 |
topic_46 | 0.0250350 | 0.0417817 | 0.5991848 | 0.5496102 |
topic_47 | 0.0684657 | 0.2844716 | 0.2406769 | 0.8100093 |
topic_48 | 0.0459670 | 0.0179390 | 2.5624016 | 0.0110015 |
topic_49 | 0.0281820 | 0.0176058 | 1.6007185 | 0.1107439 |
topic_50 | 0.0133341 | 0.0245449 | 0.5432550 | 0.5874543 |
topic_51 | 0.0768139 | 0.0237003 | 3.2410519 | 0.0013582 |
topic_52 | 0.0497146 | 0.0219592 | 2.2639504 | 0.0244631 |
topic_53 | 0.0307229 | 0.0228948 | 1.3419172 | 0.1808806 |
topic_54 | 0.0345447 | 0.0173587 | 1.9900518 | 0.0477105 |
topic_55 | NA | NA | NA | NA |
cik4127 | 0.0050528 | 0.0125512 | 0.4025697 | 0.6876202 |
cik6281 | 0.0047687 | 0.0124193 | 0.3839731 | 0.7013355 |
cik723125 | 0.0074893 | 0.0122164 | 0.6130522 | 0.5404175 |
cik723531 | 0.0044749 | 0.0134251 | 0.3333207 | 0.7391808 |
cik743316 | 0.0028395 | 0.0122850 | 0.2311381 | 0.8174028 |
cik743988 | 0.0033442 | 0.0124527 | 0.2685524 | 0.7885029 |
cik769397 | 0.0181427 | 0.0121326 | 1.4953623 | 0.1361228 |
cik779152 | 0.0058436 | 0.0122335 | 0.4776735 | 0.6333138 |
cik796343 | 0.0007463 | 0.0121797 | 0.0612720 | 0.9511932 |
cik798354 | 0.0107927 | 0.0120591 | 0.8949898 | 0.3716818 |
cik804328 | -0.0082341 | 0.0121683 | -0.6766851 | 0.4992520 |
cik813672 | 0.0012690 | 0.0121904 | 0.1041005 | 0.9171758 |
cik827054 | 0.0060625 | 0.0123319 | 0.4916091 | 0.6234413 |
cik849399 | 0.0038317 | 0.0125188 | 0.3060772 | 0.7598090 |
cik877890 | -0.0045308 | 0.0126916 | -0.3569895 | 0.7214108 |
cik883241 | 0.0046864 | 0.0123963 | 0.3780467 | 0.7057273 |
cik896878 | 0.0062140 | 0.0123736 | 0.5022019 | 0.6159821 |
cik1013462 | -0.0010103 | 0.0125133 | -0.0807342 | 0.9357201 |
cik1045810 | -0.0091426 | 0.0123840 | -0.7382571 | 0.4610735 |
cik1101215 | -0.0098667 | 0.0122508 | -0.8053917 | 0.4213843 |
cik1108524 | 0.0092281 | 0.0121335 | 0.7605490 | 0.4476668 |
cik1123360 | -0.0153089 | 0.0125179 | -1.2229583 | 0.2225350 |
cik1136893 | 0.0044118 | 0.0124185 | 0.3552601 | 0.7227042 |
cik1141391 | -0.0040708 | 0.0120817 | -0.3369387 | 0.7364551 |
cik1175454 | 0.0057457 | 0.0126414 | 0.4545141 | 0.6498662 |
cik1341439 | 0.0153888 | 0.0126262 | 1.2187986 | 0.2241074 |
cik1365135 | 0.0063853 | 0.0124522 | 0.5127884 | 0.6085671 |
cik1383312 | 0.0302267 | 0.0128217 | 2.3574562 | 0.0191976 |
cik1403161 | -0.0006274 | 0.0123547 | -0.0507804 | 0.9595424 |
glance(lm_model) %>% kable() %>%
kable_styling(position = "center")
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
---|---|---|---|---|---|---|---|---|---|---|---|
0.276611 | 0.0285065 | 0.0265696 | 1.114897 | 0.2616751 | 83 | 768.7174 | -1367.435 | -1045.549 | 0.1708383 | 242 | 326 |
The management section of 10-K downloaded reports for the 30 companies was successfully extracted. The corpus built consists of 326 reports out of a possible total of 330 spanning the years of 2010-2020. Reports were cleaned using rvest and regex. Parsing errors were removed using hunspell package, frequency filtering and token length filtering. Stopwords were removed using a number of dictionaries, including finance specific ones. Text was POS tagged and nouns, adverbs and adjectives kept for further analysis.
Event study methodology was used to calculate abnormal returns, abnormal volume and variance of abnormal returns. Cumulative abnormal returns, average abnormal returns, cumulative abnormal volume and average abnormal volume for windows of length 2 days and 5 days were used as target variables. Variance of abnormal returns for window length 5 was also used. To create a baseline model fundamental indicators and technical indicators were used. To test whether sentiment has an effect on prices and return, multiple finance specific and non finance specific dictionaries were evaluated. Change in sentiment from last year’s report was also tested as a predictor variable. In total 572 models were fitted.
Although sentiment scores appeared to be statistically significant on a number of occasions and the model adjusted R squared improved the results were not consistent. P values were generally large and the sentiment scores often had different signs. For most of the dependent variables best performing models were baseline models consisting of fundamental and technical indicators. This inconsistency became even more apparent after conducting cross validation. Best performing models were ones that used the sentimentr algorithm combined with the LM dictionary. This performance is encouraging as it is consistent with theory. A finance specific dictionary combined with an advanced algorithm that takes into account negation should produce the most accurate results. However,despite this performance on cross validation the sentimentr-lm results are not statistically significant in predicting abnormal returns. Moreover, even the “custom” dictionary fitted with returns as response variable performs poorly on cross-validation. Therefore, there is little evidence to suggest that sentiment in the management section of 10-K reports can influence prices and even less that it can be used to make decisions such as building trading strategies. Nonetheless, it is important to note that this study only uses data for 326 reports and 30 companies in total. Including all major companies for a longer time period might yield different results. Furthermore, including other sections of the reports such Item 1 “Business” and Item 1A “Risk Factors” may be a fruitful avenue for future research as including them will increase text size and enable the algorithm to capture a larger proportion of sentiment.
Topic modelling was conducted using the both supervised and unsupervised stm approach. To select K values from 4-60 were explored. Looking at semantic coherence optimal k was selected to be 9. Company fundamentals and industry features such as pe-ratio and debt to equity ratio were used as prevalence factors. The unsupervised approach resulted in 55 topics. Unfortunately both approaches did not yield topics that had statistically significant results on cumulative abnormal returns.
Feldman, R., Govindaraj, S., Livnat, J., & Segal, B. (2010). Management’s tone change, post earnings announcement drift and accruals. Review of Accounting Studies, 15(4), 915-953.
MacKinlay, A. C. (1997). Event studies in economics and finance. Journal of economic literature, 35(1), 13-39.
Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011, July). Optimizing semantic coherence in topic models. In Proceedings of the 2011 conference on empirical methods in natural language processing (pp. 262-272).
Mourtgos and Adams. “The rhetoric of de-policing: Evaluating open-ended survey responses from police officers with machine learning-based structural topic modeling” Journal of Criminal Justice. 2019.
Stewart BM. 2020. Comment on: Github, T. Non-Atomic Vectors as Metadata? #212 [Online]. Comment posted on 9 Feb 2020. Available from: https://github.com/bstewart/stm/issues/212
Roberts ME, Stewart BM, Tingley D, Lucas C, Leder-Luis J, Gadarian SK, Albertson B, Rand DG (2014). “Structural Topic Models for Open-Ended Survey Responses.” American. Journal of Political Science, 58(4), 1064–1082. doi:10.1111/ajps.12103.
Yadav, P. K. (1992). Event studies based on volatility of returns and trading volume: A review. The British Accounting Review, 24(2), 157-184.
Yan, Q., 2020. Notes for “Text Mining with R: A Tidy Approach”. [ebook] Available at: https://bookdown.org/Maxine/tidy-text-mining/.
Hadley Wickham (2020). rvest: Easily Harvest (Scrape) Web Pages. R package version 0.3.6. https://CRAN.R-project.org/package=rvest
Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
Garrett Grolemund, Hadley Wickham (2011). Dates and Times Made Easy with lubridate. Journal of Statistical Software, 40(3), 1-25. URL http://www.jstatsoft.org/v40/i03/
Silge J, Robinson D (2016). “tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” JOSS, 1(3). doi: 10.21105/joss.00037 (URL: https://doi.org/10.21105/joss.00037), <URL: http://dx.doi.org/10.21105/joss.00037>.
Hadley Wickham (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0. https://CRAN.R-project.org/package=stringr
Stefan Milton Bache and Hadley Wickham (2020). magrittr: A Forward-Pipe Operator for R. R package version 2.0.1. https://CRAN.R-project.org/package=magrittr
Hadley Wickham (2020). httr: Tools for Working with URLs and HTTP. R package version 1.4.2. https://CRAN.R-project.org/package=httr
Mario Annau (2015). tm.plugin.webmining: Retrieve Structured, Textual Data from Various Web Sources. R package version 1.3. https://CRAN.R-project.org/package=tm.plugin.webmining
Microsoft and Steve Weston (2020). foreach: Provides Foreach Looping Construct. R package version 1.5.1. https://CRAN.R-project.org/package=foreach
Microsoft Corporation and Steve Weston (2020). doParallel: Foreach Parallel Adaptor for the ‘parallel’ Package. R package version 1.0.16. https://CRAN.R-project.org/package=doParallel
Rinker, T. W. (2018). lexicon: Lexicon Data version 1.2.1. http://github.com/trinker/lexicon
Jeffrey A. Ryan and Joshua M. Ulrich (2020). quantmod: Quantitative Financial Modelling Framework. R package version 0.4.18. https://CRAN.R-project.org/package=quantmod
Jan Wijffels (2020). udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the ‘UDPipe’ ‘NLP’ Toolkit. R package version 0.8.5. https://CRAN.R-project.org/package=udpipe
Wilson Freitas (2021). bizdays: Business Days Calculations and Utilities. R package version 1.0.8. https://CRAN.R-project.org/package=bizdays
Alboukadel Kassambara (2019). ggcorrplot: Visualization of a Correlation Matrix using ‘ggplot2’. R package version 0.1.3. https://CRAN.R-project.org/package=ggcorrplot
Joshua Ulrich (2020). TTR: Technical Trading Rules. R package version 0.24.2. https://CRAN.R-project.org/package=TTR
R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Roberts ME, Stewart BM, Tingley D (2019). “stm: An R Package for Structural Topic Models.” Journal of Statistical Software, 91(2), 1-40. doi: 10.18637/jss.v091.i02 (URL: https://doi.org/10.18637/jss.v091.i02)
Ian Fellows (2018). wordcloud: Word Clouds. R package version 2.6. https://CRAN.R-project.org/package=wordcloud
Nicolas Proellochs and Stefan Feuerriegel (2021). SentimentAnalysis: Dictionary-Based Sentiment Analysis. R package version 1.3-4. https://CRAN.R-project.org/package=SentimentAnalysis
Posthuma Partners (2019). lmvar: Linear Regression with Non-Constant Variances. R package version 1.5.2. https://CRAN.R-project.org/package=lmvar
Jeroen Ooms (2021). magick: Advanced Graphics and Image-Processing in R. R package version 2.7.2. https://CRAN.R-project.org/package=magick
Jeffrey B. Arnold (2019). ggthemes: Extra Themes, Scales and Geoms for ‘ggplot2’. R package version 4.2.0. https://CRAN.R-project.org/package=ggthemes
Hao Zhu (2021). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.3.4. https://CRAN.R-project.org/package=kableExtra