Module Title: Text Analytics

Module Code: “IB9CW0”

Year of Module: “2021”

Number of Pages: 131


Abstract

In this work textual information in the Management Discussion section of 10-K reports is used to derive useful insights from a business standpoint. These reports are publicly available through the EDGAR platform on the sec website. An NLP pipline is then constructed to clean and process the text for further analyis.

In part A a corpus is constructed by scraping the index of filings from the sec website, downloading the reports and extracting the relevant, item 7, section. Next, important keywords are examined which can provide value from either an analytically or business point of view. To this aim TF-IDF analysis is performed on unigrams, bigrams and trigrams.

In part B the goal is to link the sentiment of text to different financial indicators. Namely, abnormal returns, abnormal volume and volatility of abnormal returns. First prices, fundamental indicators and holidays dates are fetched to enable the calculation of abnormal returns, volume and variance. In addition to the fundamental indicators a number of technical indicators are calculated to build a more robust baseline model. Panel data analysis is performed.

In part C topic modelling is performed in an attempt to figure out which topics are commonplace in 10-K reports. Topic modelling is a useful technique because it can be used to instantly summarize lots of text. For example, on the date of report filling analysis could use this algorithm to summarize a report in a matter of seconds and link it to previous reports. Furthermore, topics can be linked to returns and other financial indicators.


Libraries

library(edgar)
library(rvest)
library(tidyverse)
library(lubridate)
library(tidytext)
library(stringr)
library(magrittr) # for Tee pipe
library(httr) # scrape with headers
library(htmltidy) # clean broken html
library(tm.plugin.webmining) # remove html method 2
library(foreach)
library(doParallel)
library(lexicon) # stopwords
library(quantmod)
library(udpipe) # text anotation 
library(lubridate) # date manipulation
library(bizdays) # business days manipulation
library(ggcorrplot) #plot correlation matrix
library(purrr) #map2
library(TTR) # TA indicators
library(splines) # for stm temporal
library(stm)
library(wordcloud) 
library(SentimentAnalysis)
library(lmvar) # cross validation
library(magick)
library(cowplot)
library(svglite)
library(ggthemes)
library(kableExtra)
library(sjPlot)

We assume that the default action is not to buy stocks based on sentiment of 10-K reports and the corresponding null hypothesis is that stock movements are not related to sentiment pf 10-K reports.

Part A

Portfolio Construction

Sector Selection

portfolio <- read_csv("portfolio-data.csv")
names(portfolio) <- stringr::str_replace_all(names(portfolio), " ", "_")
portfolio <- portfolio %>% rename(cik = CIK)

portfolio %>%
  mutate(GICS_Sub_Industry= as.factor(GICS_Sub_Industry)) %>% 
  group_by(GICS_Sub_Industry) %>% 
  count(sort = T) %>% 
  knitr::kable(caption = 'Portfolio option summary') %>% 
  kable_styling(position = "center")
Portfolio option summary
GICS_Sub_Industry n
Semiconductors 13
Data Processing & Outsourced Services 12
Application Software 11
Technology Hardware, Storage & Peripherals 7
IT Consulting & Other Services 6
Communications Equipment 5
Electronic Equipment & Instruments 3
Internet Services & Infrastructure 3
Semiconductor Equipment 3
Systems Software 3
Electronic Components 2
Electronic Manufacturing Services 2
Technology Distributors 1

The top 3 industries have over 10 companies each. A portfolio is built from companies in these industries. The main reason for this approach is sample maximization. Primarily, this study aims to uncover a link between price and text. Secondarily,the goal is to examine the difference across industries. To achieve the later goal industries must be well represented in the sample. If differences in how price dynamics are detected across industries, further investigation can be made in industries that are less well represented. To summarize, the main motivation behind this portfolio choice is to maximize the statistical power of our study and generalisability of results.

industries <- c("Semiconductors", 
                "Data Processing & Outsourced Services", 
                "Application Software")

candidate_ciks <- portfolio %>% 
  filter(GICS_Sub_Industry %in% industries) %>% 
  pull(cik)

portfolio <- portfolio %>% 
  filter(cik %in% candidate_ciks)

rm(candidate_ciks, industries)

During the first attempt at extracting, the Edgar package was used. For the top 10 companies within each category 255 links were acquired from the master index through the edgar::getMasterIndex(2010:2020) call. In the best case scenario 70% of the records will be present. Furthermore, once the same package was used to extract the management description a further 58 records are lost leaving only 197 records for further analysis. This is 54% of the total theoretical size. Even with sophisticated replacement strategies such as extracting text from other sections and using 10-Qs instead of 10-K this is inadequate. Therefore, a custom scraping algorithm and section extraction algorithm is developed.

The results from scraping and parsing the daily_index from the sec website are significantly better than the ones provided by edgar package. Overall we get 10,991,106 records vs 7,942,110 records from the package, with roughly 20,000 more 10-K reports. Indeed for the selected industries all reports are available. A number of companies have less than 11 reports but research shows that this is a result of name change, and structural change of the company itself rather than a fault in the scraping procedure. Hence these companies are omitted as per original strategy and end up with 10 companies for each selected industry with 1 report corresponding to each year.

Another advantage of focusing on just a few sectors is that we reduce the variation of word use, which hopefully leads to more consistent sentiment scores for firms in the portfolio. Words are likely to have different meaning across companies and sectors This affects sentiment analysis negatively. By concentrating on a few industries variation is minimized.

Get the Data

Fetch Master Index Files

Utility function to help combine urls.

combURL <- function(base, addons, type="") {
  for (addon in addons) {
    base <- paste(base, addon, sep = "/")
  }
  return(paste0(base, type))
}

This operation cannot be parallelized or otherwise sped up as we need to stay under the SEC limit of 10 call per second. Additionally, Sys.sleep(0.1) is used to slow down the script. Since parsing might take a long time, index files are saved at this stage and parsed later in parallel fashion.

domain <- "https://www.sec.gov/Archives/edgar/daily-index"

if(!dir.exists("master_idx")){
  dir.create("master_idx")
}


for (year in 2010:2020) {
  for (i in 1:4) {
    qt <- paste0("QTR", i)
    url <- combURL(domain, c(year, qt, "index.json" ))
    Sys.sleep(0.1)
    GET(url, user_agent("Mozilla/5.0"), write_disk("temp.json"))
    file <- jsonlite::fromJSON("temp.json")
    for (link in file$directory$item$name) {
      if (str_detect(link, "master")) {
        url <- combURL(domain, c(year, qt, link))
        if(!dir.exists(combURL("master_idx", c(year)))) {
          dir.create(combURL("master_idx", c(year)))
        }
        if(!dir.exists(combURL("master_idx", c(year, qt)))) {
          dir.create(combURL("master_idx", c(year, qt)))
        }
        filename <- combURL("master_idx", c(year, qt, link))
        GET(url, user_agent("Mozilla/5.0"), write_disk(filename))
        Sys.sleep(0.1)
      }
    }
    file.remove("temp.json")
  }
}

Parse Master Index Files

cluster <- NULL

#utility function to register cluster.
register_cores <- function() {
  n_cores <- parallel::detectCores() - 12
  cluster <<- parallel::makeCluster(n_cores, type = "PSOCK")
  doParallel::registerDoParallel(cl = cluster)
  foreach::getDoParRegistered()
}

Parse idx files into a data frame for each year and quarter.

if(!dir.exists("master_indexes")){
  dir.create("master_indexes")
}

#Setup for parallel computing
register_cores()

#Use "foreach" loop from the foreach package which supports parrallel operations. 
for (year in 2010:2020) {
  year_master <- foreach(
    q = 1:4,
    .combine=rbind,
    .packages=c('tidyverse', 'stringr')
  ) %dopar% {
    qt <- paste0("QTR", q)
    url <- combURL("master_idx", c(year, qt))
    files <- list.files(url)
    q_master <- data.frame()
    for (file in files) {
      filename <- combURL("master_idx", c(year, qt, file))
      file <- readLines(filename)
      # split lines
      file <- str_split(file, '  ')
      # trim heading
      file <- file[8:length(file)]
      file_df <- data.frame()
      for (i in 1:length(file)) {
        # split into columns
        l <- str_split(file[[i]][1], '\\|')
        # convert to df
        df <- data.frame(cik=l[[1]][1], 
                         name=l[[1]][2], 
                         form_type=l[[1]][3], 
                         date=l[[1]][4], 
                         link=l[[1]][5], 
                         qtr = q)
        file_df <- rbind(file_df, df)
      }
      q_master <- rbind(q_master, file_df)
    }
    return(q_master)
  }
  # save result
  filename <- combURL("master_indexes", c(year), type = "_year_master.rda")
  save(year_master, file =  filename)
  rm(year_master)
  gc()
}

rm(q_master, df, file_df, l)

#close cluster
parallel::stopCluster(cl = cluster)

Combine all saved year master indexes into one data frame.

master_indexes <- list.files("master_indexes/",pattern="rda")
all_my_indexes <- data.frame()

for(master_index in master_indexes){
  load(paste0("master_indexes/",master_index))
  this_index <- year_master
  all_my_indexes <- bind_rows(all_my_indexes,this_index)
  print(master_index)
}
all_my_indexes <- all_my_indexes[-c(1:11),]

rm(this_index)

Dowload Files for Selected Industries

# update master index
all_my_indexes <- all_my_indexes %>% 
  filter(form_type == "10-K") %>%
  filter(cik %in% portfolio$cik)
domain <- "https://www.sec.gov/Archives/"

if(!dir.exists("full_text")) {
  dir.create("full_text")
}

for (i in 1:length(all_my_indexes$cik)) {
  row = all_my_indexes[i,]
  url <- paste0(domain, row$link)
  print(url)
  Sys.sleep(0.1)
  dirname <- paste0("full_text/", row$cik)
  dirname <- paste0(dirname, "/")
  print(dirname)
  if(!dir.exists(dirname)){
    dir.create(dirname)
  }
  filename <- paste0(paste0(dirname,row$date),".txt")
  print(filename)
  if(!file.exists(filename)) {
    print(filename)
    GET(url, user_agent("Mozilla/5.0"), write_disk(filename))
  }
}

rm(row, dirname, filename, url)

All files 369 downloaded successfully.

Regex Manual Extraction

# clean document titles
# clean item tags
cleanDocTitle <- function(text) {
  text <- str_replace(text, 'm&nbsp;', 'm')
  text <- str_replace(text, '>&nbsp;', ' ')
  text <- str_replace(text, '<[\\s\\S]*>', ' ')
  text <- str_replace_all(text, '\n', ' ')
  text <- str_replace_all(text, '"', ' ')
  text <- str_replace_all(text, '&#160;', ' ')
  text <- str_replace_all(text, '&nbsp;', ' ')
  text <- str_replace_all(text, ' ', '')
  text <- str_replace_all(text, '\\.', ' ')
  text <- str_replace_all(text, '>', ' ')
  text <- trimws(text)
  text <- tolower(text)
  return (text)
}

# remove html from text using rvest
strip_html <- function(text) {
  if (!is.na(text)) {
    if (text!= "") {
      tryCatch( {
        text <- html_text(read_html(text))
      }, error=function(cond) {
        text <- extractHTMLStrip(text)
      })
    }
  }
  return (text)
}
dirs <- list.dirs("full_text") 
master <- data.frame()

getSections <- function(regex, text) {
  start_end_section <- stringr::str_locate_all(text, regex)
  start_end_section <- as.data.frame(start_end_section)
  return(start_end_section)
}


for (dir in dirs[2:length(dirs)]) {
  files <- list.files(dir)
  for (file in files) {
    date <- file
    date <- str_remove(date, ".txt")
    cik <- str_remove(dir, "full_text/")
    file <- paste(dir, file, sep = "/" )
    text <-  read_file(file)
    
    doc_start <- as.data.frame(stringr::str_locate_all(text, "<DOCUMENT>"))
    doc_end <- as.data.frame(stringr::str_locate_all(text, "</DOCUMENT>"))
    type <- as.data.frame(stringr::str_locate_all(text, '<TYPE>[^\n]+'))
    
    for (i in 1:length(doc_start$start)) {
      doc <- substr(text, type$start[i],  type$end[i])
      if (str_detect(doc, "10-K")) {
        regex <- '(>|&nbsp;\\s|>&#160;|>&nbsp;)(Item|ITEM|Ite|It|"Item")(\\s|&#160;|&nbsp;|<a name="(1A|1B|7A|7|8|9|9A)">\\s|<.*?>m&nbsp;|\\s<.*?>)(1A|1B|7A|7|8|9|9A)\\.{0,1}'
        start_end_section <- getSections(regex, text=text)
        item7 <- NA
        temp_df <- NA
        tryCatch(
          {
            
            start_end_section$item  <- cleanDocTitle(substring(text, 
                                                               first=start_end_section$start, 
                                                               last= start_end_section$end))
            
            # select item 9 or 9a
            is_item_9_detected = (start_end_section %>% 
                                    filter(item == "item9") %>% 
                                    count() %>% pull(n)) > 0
            item9_lable <- ifelse(is_item_9_detected, "item9", "item9a")
            
            # select top item 9 or 9a start index
            top_item <- start_end_section %>% 
              filter(item == item9_lable) %>% 
              arrange(desc(start)) %>% 
              slice(1) %>% 
              pull(start)
            
            if (!is.na(top_item)) {
              # use top item 9 as upper bound
              start_end_section <- start_end_section %>% 
                filter(start < top_item) %>%  
                filter(!(item %in% c("item9", "item9a")))
            } 
            
            
            # top item from each item group
            start_end_section <- start_end_section %>% 
              group_by(item) %>% 
              arrange(desc(start)) %>% 
              slice(1) %>%  
              ungroup()
            
            # select item 8 or 7a
            is_item_8_detected = (start_end_section %>% 
                                    filter(item == "item8") %>% 
                                    count() %>% pull(n)) == 1
            end_lable <- ifelse(is_item_8_detected, "item8", "item7a")
            
            row_index_item8 <- start_end_section %>% 
              filter(item == end_lable) %>% 
              pull(start)
            
            # select item 7 or 7a
            is_item_7_detected = (start_end_section  %>% 
                                    filter(item == "item7") %>%
                                    count() %>% pull(n)) == 1
            
            start_lable <- ifelse(is_item_7_detected, "item7", "item7a")
            
            row_index_item7 <- start_end_section %>% 
              filter(item == start_lable) %>% 
              pull(start)
            
            # use item7a if item 7 is found after item 8. Preserves 3 reports.
            if (row_index_item7 > row_index_item8) {
              row_index_item7 <- start_end_section %>% 
                filter(item == "item7a") %>%
                pull(start)
            }
            
            
            item7 <- substr(text, start = row_index_item7, stop = row_index_item8)
            item7 <- strip_html(item7)
          },
          error=function(cond) {
            print(cik)
            print(date)
            message("Error message:")
            message(cond)
            return(NA)
          })
        if (!is.na(item7)) {
          if (item7 == "") {
            print(start_end_section)
            print(cik)
            print(date)
          }
        }
        temp_df <- data.frame(cik=cik, date=date, text=item7)
        master <- rbind(master, temp_df)
      }
    }
  }
}

rm(temp_df, file, item7, lable, row_index_item7, row_index_item8, start_end_section)

One report failed to parse. An investigation shows that there is actually no management discussion in the report. Instead there is reference to another report.

master %>% 
  mutate(text_size = nchar(text)) %$% 
  hist(text_size, breaks = 20, xlim = c(0, 1000000))

The distribution of text size helps detect more errors. Small text size is a sign that problems might have occurred. From this sample it seems anything bellow 2000 words is likely to not contain relevant text.

  • TEXAS INSTRUMENTS INC (CIK: 97476) report consists entirely of references to annual shareholder letters.
  • INTEL (CIK: 50863) is an outlier in their formatting practice. They have almost entirely forgone the traditional approach to 10-K styling.

Both this companies are eliminate from our portfolio.

Select Companies

# filter by available reports
portfolio_ciks <- master %>% 
  mutate(text_size = nchar(text)) %>% 
  filter(!(cik %in% c(97476, 50863))) %>% 
  filter(!(is.na(text) | text ==  "")) %>% 
  filter(text_size > 2000) %>% 
  unique() %>% 
  mutate(cik = as.numeric(cik)) %>% 
  inner_join(portfolio) %>% 
  count(GICS_Sub_Industry, cik) %>% 
  group_by(GICS_Sub_Industry) %>% 
  arrange(desc(n), .by_group = TRUE) %>% 
  filter(n > 7) %>% 
  pull(cik)
## Joining, by = c("cik", "Symbol", "Security", "GICS_Sector", "GICS_Sub_Industry")
# update portfolio
portfolio <- portfolio %>% 
  filter(cik %in% portfolio_ciks)

# update parsed text df
master <- master %>% 
  unique() %>%
  filter(!(is.na(text) | text ==  "")) %>%
  mutate(text_size = nchar(text)) %>%
  filter(text_size > 2000) %>%
  mutate(date = as.Date(date, format =  "%Y%m%d")) %>% 
  mutate(cik = as.numeric(cik)) %>%
  filter(cik %in% portfolio_ciks) %>% 
  inner_join(portfolio) %>% 
  mutate(doc_id = row_number())
## Joining, by = c("cik", "Symbol", "Security", "GICS_Sector", "GICS_Sub_Industry")
rm(portfolio_ciks)
#saveRDS(master, "master.rds")

Final Portfolio

portfolio$GICS_Sector <- NULL
knitr::kable(portfolio, caption = "Portfolio of Selected Companies")
Portfolio of Selected Companies
Symbol Security GICS_Sub_Industry cik
ADBE Adobe Systems Inc Application Software 796343
AMD Advanced Micro Devices Inc Semiconductors 2488
ADS Alliance Data Systems Data Processing & Outsourced Services 1101215
ADI Analog Devices, Inc.  Semiconductors 6281
ANSS ANSYS Application Software 1013462
ADSK Autodesk Inc.  Application Software 769397
BR Broadridge Financial Solutions Data Processing & Outsourced Services 1383312
CDNS Cadence Design Systems Application Software 813672
CTXS Citrix Systems Application Software 877890
FIS Fidelity National Information Services Data Processing & Outsourced Services 1136893
FISV Fiserv Inc Data Processing & Outsourced Services 798354
FLT FleetCor Technologies Inc Data Processing & Outsourced Services 1175454
GPN Global Payments Inc.  Data Processing & Outsourced Services 1123360
INTU Intuit Inc.  Application Software 896878
JKHY Jack Henry & Associates Data Processing & Outsourced Services 779152
MA Mastercard Inc.  Data Processing & Outsourced Services 1141391
MXIM Maxim Integrated Products Inc Semiconductors 743316
MCHP Microchip Technology Semiconductors 827054
MU Micron Technology Semiconductors 723125
NLOK NortonLifeLock Application Software 849399
NVDA Nvidia Corporation Semiconductors 1045810
ORCL Oracle Corp.  Application Software 1341439
PAYX Paychex Inc.  Data Processing & Outsourced Services 723531
QCOM QUALCOMM Inc.  Semiconductors 804328
CRM Salesforce.com Application Software 1108524
SWKS Skyworks Solutions Semiconductors 4127
SNPS Synopsys Inc.  Application Software 883241
V Visa Inc.  Data Processing & Outsourced Services 1403161
WU Western Union Co Data Processing & Outsourced Services 1365135
XLNX Xilinx Semiconductors 743988

NLP pipeline

Text Normalisation

Html was already striped during parsing now further cleaning needs to be done to remove digits and symbols. At this stage two cleaning functions are defined; one to remove punctuation completely the other attempts to remove extra punctuation, mainly table leftovers, but preserve the sentence structure. Whilst punctuation is not necessary in most cases, the sentimentr package used in Part B relies on punctuation to identify inflection and analyses sentiment at sentence level. However, due to the presence of tables in this dataset removing punctuation accurately is challenging. Hence, the data is split into two formats.

clean_text_retain_puntuation <- function(text) {
  #we trim the text to remove section title.
  text <- stringr::str_sub(text,start =  94,end = -1)
  #unescape unicode
  text <- stringi::stri_unescape_unicode(text)
  #sets all chars to unicode
  text <- iconv(text, "ASCII",  sub = " ")
  #removes line breaks
  text <- stringr::str_replace_all(text, "\r?\n|\r|\t", " ")
  text <- stringr::str_replace_all(text, "[[digit:]]$", " ")
  #removes digits
  text <- tm::removeNumbers(text)
  #remove $ and % sign
  text <- stringr::str_replace_all(text, "\\$|\\$(\\.)", " ")
  text <- stringr::str_replace_all(text, "%", " ")
  text <- qdap::bracketX(text)
  text <- str_squish(text)
  #remove repeating charachters
  text <- stringr::str_replace_all(text, '([?@])\\1+', " ") 
  text <- stringr::str_replace_all(text, '\\,\\.|\\.\\,', " ") 
  return (text)
}
text_with_punctuation <- parallel::mclapply(master$text, clean_text_retain_puntuation)
text_with_punctuation <- unlist(text_with_punctuation)
saveRDS(text_with_punctuation, "text_with_punctuation.rds")
clean_text <- function(text) {
  #we trim the text to remove section title.
  text <- stringr::str_sub(text,start =  94,end = -1)
  #unescape unicode
  text <- stringi::stri_unescape_unicode(text)
  #sets all chars to unicode
  text <- iconv(text, "ASCII",  sub = " ")
  #removes line breaks
  text <- stringr::str_replace_all(text, "\r?\n|\r|\t", " ")
  #removes digits
  text <- tm::removeNumbers(text)
  #remove non word breaking punctuation
  text <- tm::removePunctuation(text,
                                preserve_intra_word_contractions = T,
                                preserve_intra_word_dashes = T)
  return (str_squish(text))
}
cleaned <- parallel::mclapply(master$text, clean_text )
master$text <- unlist(cleaned)
rm(cleaned)
saveRDS(master, "master_cleaned.rds")

POS Tagging

Next, Part of Speech tagging is conducted. Importantly, this is done before stopword removal as common stopowrds provide important grammatical information which helps the tagger distinguish between nouns and verbs. In other words, POS tagging looks at the sequence as a whole.

langmodel <- udpipe::udpipe_download_model("english")
langmodel <- udpipe::udpipe_load_model(langmodel$file_model)
postagged_text <- udpipe_annotate(langmodel,
                                  master$text,
                                  parallel.cores = 15,
                                  trace = T)

postagged_text <- as.data.frame(postagged_text)
saveRDS(postagged_text, "postagged_text.rds")
postagged_text <- readRDS("postagged_text.rds")
master <- readRDS("master_cleaned.rds")

Stopword Removal

Stopwords are words that carry only grammatical meaning but provide little sentiment value on their own in a typical bag of words model. Thefore, they are removed to reduce the noise in the dataset and speed up computation.

In addition to standard stopword dictionaries such as the SMART lexicon, NLTK dictionary and Fry’s top 100 dictionary finance specific dictionaries are used. LM provide custom dictionaries for financial data on website:https://sraf.nd.edu/textual-analysis/resources/#StopWords

These are used to filter out names of auditors who audit the reports, names of management, references to geographic locations as well as numbers.

#load downloaded dictionaries 
SW_Auditor <- data.frame(word = readLines("stopwords/StopWords_Auditor.txt"))
SW_Currencies <- read_delim("stopwords/StopWords_Currencies.txt", delim = "|")[1]
names(SW_Currencies) <- c("word")
SW_DatesNumbers <- data.frame(word = readLines("stopwords/StopWords_DatesandNumbers.txt"))
SW_Geographic <- data.frame(word = readLines("stopwords/StopWords_Geographic.txt"))
SW_Names <- data.frame(word = readLines("stopwords/StopWords_Names.txt"))

StopWords_LM <- rbind(SW_Auditor,SW_Currencies, SW_Names, SW_DatesNumbers, SW_Geographic)
#after POS we will take the lemma meaning  words will be in lower case
StopWords_LM <- StopWords_LM %>% mutate(word = tolower(str_replace_all(word,"[^[:graph:]]", " ")))

rm(SW_Auditor,SW_Currencies, StopWords_Names, SW_DatesNumbers, SW_Geographic)

#remove words unrelated to content
my_stopwords <-  data.frame(word = c("Table", "of", "Contents", "table", "contents"))
#remove company names
company_names <- portfolio %>% unnest_tokens(word, Security) %>% select(word)
my_stopwords <- rbind(my_stopwords, company_names)
rm(company_names)
#Load standard stopword dictionaries
stopwords_nltk<- as.data.frame(stopwords::data_stopwords_nltk$en) 
data(sw_fry_100)
stopwords_fry <- as.data.frame(sw_fry_100)
names(stopwords_fry) <- c("word")
names(my_stopwords) <- c("word")
names(stopwords_nltk) <- c("word")

Term Frequency Filtering

Rather than filtering by tf-idf a cautious approach is exercised. All words which appear more than 5 times are kept. This deals with the vast majority of parsing mistakes. Method specific tf-idf trimming is applied as need later on.

# document term frequency filter
document_term_freq_filter <- function(tokens) {
  tokens <- tokens %>% 
    count(word) %>%
    filter(n > 5) %>% 
    inner_join(tokens)
  
  return(tokens)
}

Parsing Error Removal

#we use mistake detection to remove parsing errors
#we don't expect actual mistakes in 10-K reports
#mistake detection is done after legalization and POS filtering 
#this saves computing time as we have less tokens

#hunspell wrapper -- pre configured to use bitish english followed by us english
hunspell_double_english <- function(word) {
  word <- unlist(hunspell::hunspell(word, dict = 'en_US'))
  if (is.character(word)) {
    word <- unlist(hunspell::hunspell(word, dict = 'en_GB'))
  }
  return (word)
}

Putting it all together

# function to monitor token count in pipe
count_tokens <- function(tokens, name = "") {
  tokens %>% 
    count(word) %>% 
    summarise(total = sum(n)) %$% 
    print(total)
  print(name)
  return (tokens)
}


tokens <- postagged_text %>% 
  filter(upos %in% c("NOUN","ADJ","ADV")) %>%
  select(lemma, doc_id) %>% 
  rename(word = lemma) %>% 
  mutate(word = tolower(word)) %>%
  count_tokens("Stage: Initial") %>% 
  anti_join(stop_words, by = "word") %>% 
  count_tokens("Stage 1: after SMART") %>% 
  anti_join(stopwords_nltk, by = "word") %>% 
  count_tokens("Stage 2: after NLTK") %>% 
  anti_join(my_stopwords, by = "word") %>% 
  count_tokens("Stage 3: after My Stopwords") %>% 
  anti_join(StopWords_LM, by = "word") %>% 
  count_tokens("Stage 4: after LM Stopwords") %>% 
  mutate(token_length=nchar(word)) %>% 
  arrange(token_length) %>% 
  filter(token_length > 3) %>% 
  filter(token_length < 17) %>%
  arrange(token_length) %>% 
  count_tokens(name="stage 5: Remove Token Length Filter") %>% 
  document_term_freq_filter() %>% 
  count_tokens(name="Stage 6: Term Frequency Filter")

lematized <- tokens %>% 
  group_by(doc_id) %>% 
  summarise(documents_pos_tagged = paste(word,collapse = " "))

# remove mistakes
mistakes <- parallel::mclapply(lematized$documents_pos_tagged, hunspell_double_english)
mistakes <- unique(mistakes)
mistakes <- data.frame(word = unlist(mistakes))

master <- master %>% 
  mutate(doc_id = paste0("doc",row_number()))

tokens <- tokens %>% 
  anti_join(mistakes) %>% 
  count_tokens(name="Stage 7: After Mistake Removal") %>% 
  inner_join(master %>% select(-text))

#saveRDS(mistakes, "mistakes.rds") #backup

#update lematized text
lematized <- tokens %>% 
  group_by(doc_id) %>% 
  summarise(documents_pos_tagged = paste(word,collapse = " ")) 

#add lematized text to main df
master <- master %>% 
  left_join(lematized)

rm(mistakes, lematized, langmodel, StopWords_LM, stopwords_nltk, stopwords_fry)

saveRDS(tokens, "tokens.rds")
saveRDS(master, "master_pos.rds")

TF-IDF Analysis

At this stage, Tf-Idf analysis is used as an exploratory tool to better understand what terms in 10-K reports are important among different industries. Functions are defined which allow for dynamic tf-idf trimming and document term frequency filtering within groups. This enables exploration of keywords at different grouping levels with various levels of trim. 3 types of tokens are surveyed; unigrams, bigrams and trigrams.

The methodology used is as follows: * Examine words ranked by frequency * Examine words ranked by frequency with additional trimming * Examine words ranked by tf-idf * Examine words ranked by tf-idf with additional trimming

#bind tf-idf by specified measure
bind_tf_idf_custom  <- function(tokens, by, within_group_freq_bound) {
  tokens <- tokens %>% 
    drop_na(.data[[by]]) %>% 
    drop_na(word) %>% 
    count(word, .data[[by]]) %>% 
    filter(n > within_group_freq_bound) %>% 
    bind_tf_idf(word, .data[[by]], n)  
}

#filter tokens by tfidf for specified quantiles
trim_by_tfidf <- function(tokens, quantiles) {
  if (!is.null(quantiles)) {
    quantiles <- tokens %$%
      quantile(tf_idf, probs = quantiles) %>% 
      tidy(quantiles, na.rm = F)
    tokens <- tokens %>% 
      filter(tf_idf > quantiles$x[1], tf_idf < quantiles$x[2]) 
  } 
  return(tokens)
}

#group and count based on measure specified
summarise_conditionally <- function(tokens, measure, group_by) {
  if (measure != "n") {
    tokens <- tokens %>% 
      group_by(.data[[group_by]]) %>% unique()
  } else {
    tokens <- tokens %>% 
      group_by(.data[[group_by]], word) %>%
      summarise(n= sum(n))
  }
  return(tokens)
}

#bind by category, group by same or other category
#filter by tf-idf or within group frequency
#plot top n tokens for group
filter_bind_plot <- function(df, 
                             tokens, 
                             id, 
                             bind_by, 
                             group_by,
                             measure, 
                             within_group_freq_bound = 0, 
                             quantiles = NULL, 
                             n = 10) {
  
  meta <- df %>% select(.data[[id]],.data[[group_by]], .data[[ bind_by]])  %>% unique(.) 
    tokens %>% 
      bind_tf_idf_custom(by=bind_by, within_group_freq_bound) %>% 
      trim_by_tfidf(quantiles) %>% 
      inner_join(meta) %>% 
      select(.data[[measure]], .data[[group_by]], word) %>% 
      summarise_conditionally(measure = measure, group_by = group_by) %>% 
      arrange(desc(.data[[measure]])) %>% 
      mutate(row_number = row_number()) %>% 
      filter(row_number %in% 1:15) %>% 
      facet_bar(y = word, x = .data[[measure]], by = .data[[group_by]], name = name)
  
}

#adapted from (Yan, 2020)
#utility function which reorders words based on give measure within groups
facet_bar <- function(df, y, x, by, nrow = 1, ncol = 3, scales = "free", name="") {
  mapping <- aes(y = reorder_within({{ y }}, {{ x }}, {{ by }}), 
                 x = {{ x }}, 
                 fill = {{ by }})
  
  facet <- facet_wrap(vars({{ by }}), 
                      nrow = nrow, 
                      ncol = ncol,
                      scales = scales) 
  
  ggplot(df, mapping = mapping) + 
    geom_col(show.legend = FALSE) + 
    scale_y_reordered() + 
    facet + 
    ylab("") + 
    theme_light()
} 

#dir to save images
if(!dir.exists("PartA")){
  dir.create("PartA")
}

Unigrams

GICS_Sub_Industry: Frequency

filter_bind_plot(master,  
                 tokens,
                 id="doc_id", 
                 group_by = "GICS_Sub_Industry",
                 bind_by = "GICS_Sub_Industry", 
                 measure = "n", n =15)
ggsave("PartA/figure1.png", width = 40, height = 15, units = "cm")

Top 15 terms in per GICS_Sub_Industry ranked by frequency

filter_bind_plot(master,  
                 tokens,
                 id="doc_id", 
                 group_by = "GICS_Sub_Industry",
                 bind_by = "cik", 
                 measure = "n", n =15)
ggsave("PartA/figure2.png", width = 40, height = 15, units = "cm")

Defining document level at company or entire industry does not yield much variation in terms of most frequent terms. With little variation accross industries it appears that these tokens are used in any typical 10-K report. This makes sense as companies are expected to discuss “revenue”, “cost” and “income”.

filter_bind_plot(master,  
                 tokens,
                 id="doc_id", 
                 group_by = "GICS_Sub_Industry",
                 bind_by = "GICS_Sub_Industry",
                 measure = "n", 
                 quantiles = c(0, 0.9999), n=15)
ggsave("PartA/figure3.png", width = 40, height = 15, units = "cm")

Trimming the bag of words model using tf-idf removes most of the overlapping frequent terms.

GICS_Sub_Industry: Tf-Idf

filter_bind_plot(master,  
                 tokens,
                 id="cik", 
                 group_by = "GICS_Sub_Industry",
                 bind_by = "GICS_Sub_Industry", 
                 measure = "tf_idf", n=15)
ggsave("PartA/figure4.png", width = 40, height = 15, units = "cm")

Ranking words using tf-idf without trimming results in a selection of specific industry terms. For example, looking at the semiconductor industry, terms such a “wafer”, “gigabit”, “chipset” and “foundry” are dominant. These all refer to the manufacturing or components of video cards.

filter_bind_plot(master,  
                 tokens,
                 id="doc_id", 
                 group_by = "GICS_Sub_Industry",
                 bind_by = "GICS_Sub_Industry", 
                 measure = "tf_idf", 
                 within_group_freq_bound = 100)
ggsave("PartA/figure5.png", width = 40, height = 15, units = "cm")

By applying heavy within group term frequency filter removes rare terms with high tf-idf allowing more dominant terms to stand out. For example, the name “MasterCard” is successfully removed from Data Processing industry.

Overall this final analysis presents a good illustration of important keywords across the 3 industries.

In the case of the application software subscription is the most dominant term.

filter_bind_plot(master,  
                 tokens,
                 id="doc_id", 
                 group_by = "GICS_Sub_Industry",
                 bind_by = "GICS_Sub_Industry", 
                 measure = "tf_idf", 
                 quantiles = c(0.01, 0.98))
ggsave("PartA/figure6.png", width = 40, height = 15, units = "cm")

Trimming using tf-dif rather than within group frequency yields a messier set of terms. This is because infrequent terms can still persist amongs the different catagories.

Bigrams

bigrams <- master %>% 
  unnest_tokens(word, documents_pos_tagged, token="ngrams", n=2) %>% 
  drop_na(word)

ggplot(head(bigrams %>% group_by(word) %>% count() %>% arrange(desc(n)),15), 
       aes(reorder(word,n), n)) +
  geom_bar(stat = "identity") + coord_flip() +
  xlab("Bigrams") + ylab("Frequency") +
  ggtitle("Most frequent bigrams")

ggsave("PartA/figure7.png", width = 40, height = 15, units = "cm")

filter_bind_plot(master,
           bigrams,
           id="doc_id", 
           group_by = "GICS_Sub_Industry", 
           bind_by = "GICS_Sub_Industry",
           measure = "n", quantiles = c(0.25, 0.98))
ggsave("PartA/figure8.png", width = 40, height = 15, units = "cm")

filter_bind_plot(master,
           bigrams,
           id="doc_id", 
           group_by = "GICS_Sub_Industry", 
           bind_by = "GICS_Sub_Industry",
           measure = "tf_idf", within_group_freq_bound = 50)
ggsave("PartA/figure9.png", width = 40, height = 15, units = "cm")

Analysis of Trigrams is not very informative. Nonetheless, it does support some of the finding from unigram analyis. For example, in the semi-conductors industry reports many parts of video cards are metnioned whilst in application software subscription is commonly discussed.

Trigrams

For trigrams raw text is used in an attempt to get more meaningfull results.

trigrams <- master %>% 
  unnest_tokens(word, text, token="ngrams", n=3) %>% 
  drop_na(word) 

ggplot(head(trigrams %>%
  group_by(word) %>% 
  count() %>% 
  arrange(desc(n)),15), aes(reorder(word,n), n)) +
  geom_bar(stat = "identity") + coord_flip() +
  xlab("Tigrams") + ylab("Frequency") +
  ggtitle("Most frequent tigrams")

ggsave("PartA/figure10.png", width = 40, height = 15, units = "cm")

filter_bind_plot(master,
                 trigrams,
                 id="doc_id", 
                 group_by = "GICS_Sub_Industry", 
                 bind_by = "GICS_Sub_Industry",
                 measure = "tf_idf")

A “Air miles reward” suggests that credit card rewards are an important topic in the Data Pocessing and Outsourced Services sub industry. Not surprising given that Visa and Mastercard are part of this sector. Software license updates, hardware system support as well as subscriptions are domminat tokens in the Application Software Sector. In the semiconductor industry the focus is on shipping and manifacturing. This insight will be usefull when lableing topics in the topic modleling phase.


Part B

An event study is performed at this stage. Event studies have a long history going back to 1933. The main idea behind event studies is to compare the return of a stock during some window after an event to some baseline estimated from past data. This return is known as the abnormal return which can be attributed to the event. Largely we follow the methodology of MacKinlay (1997).

Formally abnormal return is defined as:

\[ AR_{it} = R_{it} - E(R_{it}|X_t) \]

Abnormal Returns = Actually - Normal for time window t. \[ X_t \] represents some conditional information on which the normal return is modeled.

Normal return needs to be modeled before we begin our analysis. A simple approach would be to take the return on the stock a week before. However, this is not a sound approach since anticipation of filings report is already affecting the market prices. In practice two models are often used; the constant mean return model and the market model. The first assumes that a given security has a constant mean return across time. The second assumes that there is a linear relationship between the return on the security and the market (MacKinlay, 1997). The market model is an improvement over the constant mean model because it helps reduce the variance associated with market movements. There are more complicated models such as the Fama French 3 factor model and 5 factor model which help reduce the variance associated with different firm types. However, in our case we assume that there are indeed abnormal returns at least in some cases and we seek to understand whether these can be attributed to sentiment of the management section of the report. So instead of including fundamental indicator information into market model during the abnormal return estimation phase it is used as a control during the regression on sentiment phase.

After a model is chosen and event window defined, abnormal and cumulative abnormal returns on a stock after filling are calculated by subtracting benchmark/normal returns from actual returns. Following MacKinlay (1997) the estimation window for normal returns is defined as 250 trading days before the event window. This is roughly equivalent to 1 calendar year. The event window is defined as 2 weeks before filing and one week after filing. The two week gap serves to ensure independence of normal returns from the event driven returns. Abnormal returns are only calculated for the period after the filing.

Calendar non-trading day adjustments

Before prices are fetched to calculate returns dates need to formated and adjusted. This is done because the stock market is closed on weekends and holidays. This means that if we simply take n days after and before filing to calculate some financial indicator we will get NA values if this date falls on a holiday. To prevent this we need to offset days based on the business day calendar. Hence we declare holidays using to the bizdays package and offset using its api. This approach is in line with how returns are normally calculated in the financial sector.

Holidays for which US stock exchange does not work in 2021:

  • New Year’s Day: Friday, Jan. 1
  • Martin Luther King Jr. Day: Monday, Jan. 18
  • Washington’s Birthday/Presidents Day: Monday, Feb. 15
  • Good Friday: Friday, April 2
  • Memorial Day: Monday, May 31
  • Independence Day: Monday, July 5 (observed, because July 4 falls on a Sunday)
  • Labor Day: Monday, Sept. 6
  • Thanksgiving: Thursday, Nov. 25
  • Christmas: Friday, Dec. 24 (observed, because Christmas Day falls on a Saturday)

Note that many holidays are shifted when they fall on weekdays, whilst others depend on week number. Other holidays like “Inauguration Day” occur every 4 years. This means that formulating a rules based approach is tedious and error prone. Instead a publicly available API is used to get all the days for US Federal holidays. However, Good Friday is not a national holiday but a state one. Therefore, dates for this holiday a scraped from another website. A number of dates are added manually: the stock market closed during Hurricane Sandy and to commemorate George Bush’s death.

domain <- "https://date.nager.at/api/v3/PublicHolidays"
years <- 2008:2021

holidays <- c()
for (year in years) {
  url <- url(combURL(domain, c(year, "US")))
  json <- jsonlite::stream_in(url)
  holidays <- c(holidays, json$date)
  on.exit(close(url))
}


#get page with religious public holidays
public_holidays <- read_html("http://www.maa.clell.de/StarDate/publ_holidays.html")

tbl1 <- public_holidays %>% html_nodes("table") %>% 
  .[[6]] %>% 
  html_table() %>% 
  as.data.frame() %>% 
  filter(X1 %in% years) %>% 
  mutate(X3 = str_replace_all(X3, "\\.", "-")) %>% 
  mutate(date = ymd(ydm(paste(X1, X3, sep="-")))) %>% 
  pull(date)

tbl2 <- public_holidays %>% html_nodes("table") %>% 
  .[[7]] %>% 
  html_table() %>% 
  as.data.frame() %>% 
  filter(X1 %in% years) %>% 
  mutate(X3 = str_replace_all(X3, "\\.", "-")) %>% 
  mutate(date = ymd(ydm(paste(X1, X3, sep="-")))) %>% 
  pull(date)

good_friday_dates <- c(tbl1, tbl2)

# add dates manually
dates <- c("2018-12-05", # george bushes death 
           "2012-10-29", # Hurricane Sandy
           "2012-10-30") # Hurricane Sandy

holidays <- c(holidays, as.character(good_friday_dates), dates)


# declare holidays and weekends
create.calendar(name="mycal", 
                weekdays=c('saturday', 'sunday'),
                holidays=holidays)


rm(good_friday_dates, dates, tbl2, tbl1, year, json, public_holidays, url, domain, years)

Fundamental Indicatorse

master <- master %>% 
  mutate(year = format(date, '%Y'))
master$year <- as.numeric(master$year)
domain <- "https://www.macrotrends.net/stocks/charts"
endpoints <- c("pe-ratio", "shares-outstanding", "eps-earnings-per-share-diluted", 
               "debt-equity-ratio", "roe", "roi", "roa")

colums_needed <- c("PE Ratio", "Debt to Equity Ratio", "Return on Equity", 
                   "Return on Investment", "Return on Assets")


fundamental_indicators <- data.frame()

for (i in 1:nrow(portfolio)) {
  company_fundamentals <- data.frame()
  company_name <- portfolio [i,]$Security
  company_symbol <- portfolio [i,]$Symbol
  company_name <- str_replace_all(company_name, ' ', "-")
  for (endopoint in endpoints) {
    url <- combURL(domain, c(company_symbol, company_name, endopoint))
    print(url)
    html <- read_html(url)
    tbl <- html %>% html_nodes("table") %>% .[[1]] %>% html_table() %>% as.data.frame() 
    if (!(endopoint %in% c("shares-outstanding", "eps-earnings-per-share-diluted"))) {
      names(tbl) <- as.matrix(tbl[1, ])
      tbl <- tbl[-1, ]
      tbl[] <- lapply(tbl, function(tbl) type.convert(as.character(tbl)))
    }
    temp_df <- data.frame(date=as.character(tbl[,1]), 
                          value=tbl[,names(tbl) %in% colums_needed])
    if (ncol(temp_df) == 1) {
      temp_df <- data.frame(date=parse_number(as.character(tbl[,1])), 
                            value=parse_number(as.character(tbl[,2])))
    } else {
      temp_df <- data.frame(date=parse_number(as.character(temp_df[,1])), 
                            value=parse_number(as.character(temp_df[,2])))
    }
    names(temp_df) <- c("date", endopoint)
    if (!(endopoint %in% c("shares-outstanding", "eps-earnings-per-share-diluted"))) {
    temp_df <- temp_df %>% 
      mutate_if(is.character, ~ year(ymd(date))) 
    }
    temp_df <- temp_df %>%  filter(date %in% 2008:2020) 
    temp_df <- aggregate(temp_df[,2], list(temp_df$date), mean)
    
    names(temp_df) <- c("year", endopoint)
    if (length(company_fundamentals) == 0) {
      company_fundamentals <- temp_df
    } else {
      company_fundamentals <- temp_df %>% inner_join(company_fundamentals)
    }
    company_fundamentals$Symbol <- company_symbol
    
  }
  fundamental_indicators <- rbind(fundamental_indicators, company_fundamentals)
}  

names(fundamental_indicators) <- str_replace_all(names(fundamental_indicators), "-", "_")

rm(company_fundamentals, company_name, company_symbol, endopoint, 
   endpoints, domain, temp_df, url, colums_needed, tbl, html)

colSums(!is.na(fundamental_indicators[,2:ncol(fundamental_indicators )]))

#roi column has lots of missing values unlike all other columns
#given the small size of our data set this indicator is dropped
fundamental_indicators <- fundamental_indicators %>% 
  select(-roi)

saveRDS(fundamental_indicators, "fundamental_indicators.rds")

Indicator EDA

master %>% 
  inner_join(fundamental_indicators, by = c("Symbol", "year")) %>% 
  ggplot(., aes(x=roa)) + 
  geom_density(aes(group=GICS_Sub_Industry, fill=GICS_Sub_Industry), alpha = 0.4)

Sminconductor industry has higher roa indicating higher profitability per assets deployed.

master %>% 
  inner_join(fundamental_indicators, by = c("Symbol", "year")) %>% 
  ggplot(., aes(x=roe)) + 
  geom_density(aes(group=GICS_Sub_Industry, fill=GICS_Sub_Industry), alpha = 0.4) +  
  xlim(-50, 100)
## Warning: Removed 22 rows containing non-finite values (stat_density).

master %>% 
  inner_join(fundamental_indicators, by = c("Symbol", "year")) %>% 
  ggplot(., aes(x=pe_ratio)) + 
  geom_density(aes(group=GICS_Sub_Industry, fill=GICS_Sub_Industry), alpha = 0.4) + 
  xlim(0, 100)
## Warning: Removed 18 rows containing non-finite values (stat_density).

Application Software has a more spread out distribution of PE ratio. Semiconductor industry is most conservative in terms of price per earnings.

eps_earnings_per_share_diluted

master %>% 
  inner_join(fundamental_indicators, by = c("Symbol", "year")) %>% 
  ggplot(., aes(x=debt_equity_ratio)) + 
  geom_density(aes(group=GICS_Sub_Industry, fill=GICS_Sub_Industry), alpha = 0.4) + 
  xlim(0, 15)
## Warning: Removed 17 rows containing non-finite values (stat_density).

The semiconductor industry has lowest levels of leverage per equity available. Data Proccessing and Outsourced Services sectors has highest amount of debt.

master %>% 
  inner_join(fundamental_indicators, by = c("Symbol", "year")) %>% 
  ggplot(., aes(x=eps_earnings_per_share_diluted)) + 
  geom_density(aes(group=GICS_Sub_Industry, fill=GICS_Sub_Industry), alpha = 0.4)

Earnings per share appear to be largely the same across the industries examined.

Financial Indicator Fetching and Calculation

Log returns are used in this analysis. Using log price has a number of advantages, including ease of arithmetic manipulation. In most cases, with exception with some technical indicators, adjusted closing price is used since it takes into account corporate actions such as stock splits.

Note, some filings take place even when stock market is closed; namely during Hurricane Sandy. So we need to offset the original date also in this rare case. This makes our script more future proof.

  EXAMPLE <- master[1,]

  comp <- getSymbols(EXAMPLE$Symbol, from=bizdays::offset(EXAMPLE$date, -30, "mycal"),to=bizdays::offset(EXAMPLE$date, 30, "mycal"), auto.assign=FALSE)
  
  daterange <- paste(bizdays::offset(EXAMPLE$date, -30, "mycal"), bizdays::offset(EXAMPLE$date, 30, "mycal"), sep = "::")
  
  chartSeries(comp,
              subset=daterange,
              theme=chartTheme('white'), TA = "addTA(xts(TRUE,EXAMPLE$date),on=-1, col='blue')")
EXAMPLE <- master[1,]

getSymbols(EXAMPLE$Symbol, from=bizdays::offset(EXAMPLE$date, -30, "mycal"),to=bizdays::offset(EXAMPLE$date, 30, "mycal"))

daterange <- paste(bizdays::offset(EXAMPLE$date, -30, "mycal"), bizdays::offset(EXAMPLE$date, 30, "mycal"), sep = "::")

chartSeries(get(EXAMPLE$Symbol),
            subset=daterange,
            theme=chartTheme('white'), TA = "addTA(xts(TRUE,EXAMPLE$date),on=-1, col='blue')") 
EXAMPLE <- master[2,]

getSymbols(EXAMPLE$Symbol, from=bizdays::offset(EXAMPLE$date, -30, "mycal"),to=bizdays::offset(EXAMPLE$date, 30, "mycal"))

daterange <- paste(bizdays::offset(EXAMPLE$date, -30, "mycal"), bizdays::offset(EXAMPLE$date, 30, "mycal"), sep = "::")

no_axis <- x <- chartSeries(ANSS,
            subset=daterange,
            theme=chartTheme('white'), TA = "addTA(xts(TRUE,EXAMPLE$date),on=-1, col='blue')")
EXAMPLE <- master[13,]

getSymbols(EXAMPLE$Symbol, from=bizdays::offset(EXAMPLE$date, -30, "mycal"),to=bizdays::offset(EXAMPLE$date, 30, "mycal"))

daterange <- paste(bizdays::offset(EXAMPLE$date, -30, "mycal"), bizdays::offset(EXAMPLE$date, 30, "mycal"), sep = "::")

chartSeries(ANSS,
            subset=daterange,
            theme=chartTheme('white'), TA = "addTA(xts(TRUE,EXAMPLE$date),on=-1, col='blue')")

Note, prices for 2009 are also extracted as reports filed in 2010 have an estimation period outside the bounds of the specified date range.

tickers <- master %>% pull(Symbol) %>% unique()
#list to store all prices
prices <- lapply(tickers, getSymbols, auto.assign=FALSE, from='2009-01-01',to='2020-12-31')
names(prices) <- tickers

Returns

Daily log returns are calculated. Log returns can be summed up together to get cumulative returns for a given period.

my_return <- function(x) {
    y <- dailyReturn(x, type="log")
    names(y) <- strsplit(names(x)[1], "\\.")[[1]][1]
    y
}

returns <- lapply(prices, my_return)
names(returns) <- tickers

Volume

Volume is the total number of a security traded in a given time period. However, companies have different number of outstanding shares. Thus, we first need to normalize the daily volume data to enable comparison between companies. Furthermore, similar to abnormal returns log of volume per share is used. This is done because volume is not normally distributed, breaking statistical and financial assumption (Yadav, 1992). Log transform solves this issue, but adding a small constant is required to prevent NA values at zero volume (Yadav, 1992). Additional information about the formula used can be found at: https://www.eventstudytools.com/volume-event-study

dailyVolume <- function(v, y, df) {
  shares <- df %>% filter(year == y) %>%  pull(shares_outstanding)
  vol <- log(((v + 0.00025)/shares*1000))
  vol
}

my_volume <- function(i) {
    x <- prices[[i]]
    name <- names(prices)[i]
    df <- fundamental_indicators %>% filter(Symbol == name)
    v=Vo(x)
    v$vol <- mapply(dailyVolume, v=v, year(index(x)), list(df))
    return(v[,2])
}

volume <- lapply(seq_along(prices), my_volume)
names(volume) <- tickers

Calculating

# utility function to help subset list of returns calculated earlier
subset_comp <- function(x, start_date, end_date) {
  x <- x[as.character(paste(start_date, end_date, sep = "/"))]
  as.data.frame(x)
}

# get avg market return for specified date range
# exclude target symbol
get_market_return <- function(symbol, estimation_start_date, estimation_end_date) {
  avg_market_daily_returns <- mapply(FUN = subset_comp, 
                                     x = returns[tickers[tickers != symbol]], 
                                     start_date = estimation_start_date, 
                                     end_date = estimation_end_date)
  avg_market_daily_returns  <- as.data.frame(
    matrix(unlist(avg_market_daily_returns), 
           nrow=length(unlist(avg_market_daily_returns[1]))))
  avg_market_daily_returns  <- rowMeans(avg_market_daily_returns, na.rm=T)
  return(avg_market_daily_returns)
}


# get avg market volume for specified date range
# exclude target symbol
get_market_volume <- function(symbol, estimation_start_date, estimation_end_date) {
  avg_market_daily_volume <- mapply(FUN = subset_comp, 
                                    x = volume[tickers[tickers != symbol]], 
                                    start_date = estimation_start_date, 
                                    end_date = estimation_end_date)
  avg_market_daily_volume  <- as.data.frame(
    matrix(unlist(avg_market_daily_volume), 
           nrow=length(unlist(avg_market_daily_volume[1]))))
  avg_market_daily_volume  <- rowMeans(avg_market_daily_volume)
  print(avg_market_daily_volume)
  return(avg_market_daily_volume)
}

# fit market model
get_market_model <- function(avg_market_daily_returns_in_est, comp_returns_in_est) {
  print(avg_market_daily_returns_in_est)
  print(comp_returns_in_est)
  x <- cbind(comp_returns_in_est, avg_market_daily_returns_in_est)
  x <- as.data.frame(x)
  names(x) <- c("company_returns", "market_return")
  market_model <- lm(company_returns ~ market_return, data=x)
  return(market_model) 
}

# predict using market model
calculate_normal_returns <- function(market_model, comp_returns, avg_market_daily_returns) {
  x <- cbind(comp_returns, avg_market_daily_returns)
  x <- as.data.frame(x)
  names(x) <- c("company_returns", "market_return")
  normal_returns <- predict(market_model, x)
}

#
# Main function to be used in a mapping operation.
#


returns_calc <- function(symbol, date) {
  #offset required dates
  date <- if_else(date %in% as.Date(holidays), bizdays::offset(date, 1, "mycal"), date)
  day_before <- bizdays::offset(date, -1, "mycal")
  next_day <- bizdays::offset(date, 1, "mycal")
  next_week <- bizdays::offset(day_before, 5, "mycal")
  next_month <- bizdays::offset(day_before, 21, "mycal")
  next_year <- bizdays::offset(day_before, 250, "mycal")
  prior_2_weeks <- bizdays::offset(day_before, -14, "mycal")
  estimation_start_date <- bizdays::offset(prior_2_weeks, -250, "mycal")
  
  # VOLUME
  
  # next day
  
  # market volume during estimation period
  avg_market_daily_volume <- get_market_volume(symbol, estimation_start_date, prior_2_weeks)
  # target company volume during estimation period
  interval <- as.character(paste(estimation_start_date, prior_2_weeks, sep = "/"))
  comp_volume_in_est <-  as.numeric(volume[[symbol]][interval, ])
  # market model
  market_model <- get_market_model(avg_market_daily_volume, comp_volume_in_est)
  
  # calculating normal volume after filling
  avg_market_daily_volume <- get_market_volume(symbol, date, next_day)
  interval <- as.character(paste(date, next_day, sep = "/"))
  comp_volume <-  as.numeric(volume[[symbol]][interval, ])
  normal_volume <- calculate_normal_returns(market_model, comp_volume, avg_market_daily_volume)
  
  # calculating abnormal volume after filling
  abnormal_volume <- comp_volume - normal_volume
  avg_abnormal_volume_next_day <- mean(abnormal_volume)
  cum_abnormal_volume_next_day <- sum(abnormal_volume)
  volume_direction_next_day <- ifelse(cum_abnormal_volume_next_day > 0, 1, 0)
  print("cum abnormal vol")
  print(cum_abnormal_volume_next_day)
  
  # RETURNS
  
  # market returns during estimation period
  avg_market_daily_returns_in_est <- get_market_return(symbol, estimation_start_date, prior_2_weeks)
  # target company return during estimation period
  interval <- as.character(paste(estimation_start_date, prior_2_weeks, sep = "/"))
  comp_returns_in_est <-  as.numeric(returns[[symbol]][interval, ])
  # market model
  market_model <- get_market_model(avg_market_daily_returns_in_est, comp_returns_in_est)
  
  #AFTER FILLING
  
  # next day
  
  # calculating normal returns after filling
  avg_market_daily_returns <- get_market_return(symbol, date, next_day)
  interval <- as.character(paste(date, next_day, sep = "/"))
  comp_returns <-  as.numeric(returns[[symbol]][interval, ])
  normal_returns <- calculate_normal_returns(market_model, comp_returns, avg_market_daily_returns)
  
  # calculating abnormal returns after filling
  abnormal_returns <- comp_returns - normal_returns
  avg_abnormal_return_next_day <- mean(abnormal_returns)
  cum_abnormal_return_next_day <- sum(abnormal_returns)
  direction_next_day <- ifelse(cum_abnormal_return_next_day > 0, 1, 0)
  
  # week
  
  # calculating normal returns after filling
  avg_market_daily_returns <- get_market_return(symbol, date, next_week)
  interval <- as.character(paste(date, next_week, sep = "/"))
  comp_returns <-  as.numeric(returns[[symbol]][interval, ])
  normal_returns <- calculate_normal_returns(market_model, comp_returns, avg_market_daily_returns)
  
  # calculating abnormal returns after filling
  abnormal_returns <- comp_returns - normal_returns
  avg_abnormal_return <- mean(abnormal_returns)
  cum_abnormal_return <- sum(abnormal_returns)
  variance_next_week <- var(abnormal_returns)


  #BEFORE FILLING
  
  before <- prices[[symbol]][as.character(paste(prior_2_weeks, day_before, sep = "/")), ]
  # TA indicators
  # roc - momentum indicator
  roc <- as.numeric(ROC(Ad(before),n = 7)[day_before])
  # standard moving average
  ma7 <- as.numeric(SMA(Ad(before), 7)[day_before]) 
   # rsi - momentum indicator - strength of current movement direction
  rsi <- as.numeric(RSI(Ad(before), 7)[day_before]) 
  # is a measure of the money flowing into or out of a security.
  
  obv <- as.numeric(OBV(Ad(before), Vo(before))[day_before]) 
  df <- cbind(cum_abnormal_return, avg_abnormal_return, 
              avg_abnormal_return_next_day, cum_abnormal_return_next_day,
              avg_abnormal_volume_next_day,cum_abnormal_volume_next_day, volume_direction_next_day,
              roc, ma7, rsi, obv, variance_next_week)
  names(df) <- c("cum_abnormal_return_next_week", "avg_abnormal_return_next_week", 
                 "avg_abnormal_return_next_day", "cum_abnormal_return_next_day",
                 "avg_abnormal_volume_next_day","cum_abnormal_volume_next_day", "volume_direction_next_day",
                  "roc", "ma7", "rsi", "obv", "variance_next_week")
  df
}


indicators <- master  %>% select(doc_id, Symbol, date)
x <-  mcmapply(returns_calc, symbol=indicators$Symbol, date=indicators$date)
indicators <- data.frame(cbind(as.data.frame(t(x)), indicators), row.names = 1:nrow(indicators))

rm(prices, returns, volume, tickers, end.time, start.time, 
   my_return, my_volume, returns_calc, subset_comp, i)

saveRDS(indicators, "indicators.rds")

Indicator EDA

Some of the indicators choosen are highly correlated. Therefore, one or two need to be removed to prevent multicolinearity problems.

corr <- round(cor(fundamental_indicators %>% 
                    select(roa, roe, debt_equity_ratio, 
                           eps_earnings_per_share_diluted, pe_ratio), 
                  method="spearman"), 1)
ggcorrplot(corr)

Roe and Roa are correlated; one of them should be removed before regression.

hist(indicators$avg_abnormal_return_next_day, breaks = 40)

hist(indicators$cum_abnormal_return_next_day, breaks = 40)

Examining the distribution of average abnormal return shows that in most cases there are no abnormal returns. Therefore, there is an opportunity to create a new feature – movement direction. Loss id defined as the 25th percentile of abnormal returns whilst gain is defined as everything above the 75th percentile. Everything in between is defined as stay. This feature engineering should reduce the noise in our target variable.

indicators <- indicators %>% 
  mutate(movment_direction = 
           ifelse(cum_abnormal_return_next_day < quantile(indicators$cum_abnormal_return_next_day)[2], 
                  "loss", "stay")) %>% 
  mutate(movment_direction = 
           ifelse(cum_abnormal_return_next_day > quantile(indicators$cum_abnormal_return_next_day)[4],
                  "gain", movment_direction)) %>% 
  mutate(movment_direction = factor(movment_direction))

Sentiment Calculation

Tidytext is used to get afinn, nrc, and bind dictionaries. Alternatively, the syuzhet package can be used. However, tidytext dictionaries provide more flexibility in terms of token manipulation and sentiment calculation. SentimentAnalysis package is used to get polarity with Loughran and McDonald’s, Harvard’s General Inquirer and Henry’s dictionaries as well as LM Uncertainty Ratio. The Loughran and McDonald’s as well as Henry’s dictionaries are fiance specific. Thus, it is expected that they will produce more accurate scores. Furthermore, the SentimentAnalysis package is used to fit a custom sentiment dictionary by setting cumulative returns as the response variable. Finally, sentimentr package is used to find sentiment whilst taking into consideration negation and inflection. The default dictionary is used with this package as well as a custom one based on the loughran mcdonald dictionary dictionary from the tidytext. Sentimentr function is an faster and more accurate alternative to the qdap polarity function created by the same author.

#nrc
#grab sum of emotional words to normalize emotions in next step
total <- tokens %>%
  inner_join(get_sentiments("nrc")) %>%
  filter(!(sentiment %in% c("positive", "negative"))) %>% 
  group_by(doc_id) %>% 
  count(doc_id, sentiment) %>% 
  summarize(total = sum(n)) %>% 
  pull(total)

#polarity + emotions
sent <- tokens %>%
  inner_join(get_sentiments("nrc")) %>% 
  count(doc_id,sentiment) %>%
  pivot_wider(names_from=sentiment, values_from=n) %>%
  mutate(sentiment_nrc = (positive - negative)/(positive+negative)) %>%
  select(-negative, -positive) %>% 
  mutate(across(c(2:9), .fns = ~./total)) 

#bing
sent <- tokens %>% 
  inner_join(get_sentiments("bing")) %>%
  count(doc_id,sentiment) %>%
  pivot_wider(names_from = sentiment,values_from = n) %>%
  mutate(sentiment_bing = (positive-negative)/(positive+negative)) %>%
  select(-negative, -positive)  %>% 
  inner_join(sent)

#afin
sent <- tokens %>% 
  inner_join(get_sentiments("afinn")) %>%
  group_by(doc_id) %>% 
  summarise(sentiment_afinn = sum(value))   %>% 
  inner_join(sent)

#sentimentr package

#default dictionary

text_with_punctuation <- readRDS("text_with_punctuation.rds")
text_with_punctuation  <- sentimentr::get_sentences(text_with_punctuation)
sentimentr <- sentimentr::sentiment_by(text_with_punctuation)
sent$sentimentr  <- sentimentr$ave_sentiment

#LM dictionary
lm <- tidytext::get_sentiments("loughran")
lm_key <- data.frame(
    words = lm$word,
    polarity = ifelse(lm$sentiment == "positive", 1, -1),
    stringsAsFactors = T
)

lm_key <- sentimentr::as_key(lm_key)
sentimentr_lm <- sentimentr::sentiment_by(text_with_punctuation, polarity_dt = lm_key)
sent$sentimentr_lm <- sentimentr_lm$ave_sentiment

#LM, Gi, HE, Qdap sentiment dictionaries frin SentimentAnalysis package
sentiment <- SentimentAnalysis::analyzeSentiment(master$documents_pos_tagged, 
                                                 stemming=FALSE, removeStopwords=FALSE)
sentiment$WordCount <- NULL
sent <- cbind(sent,sentiment)

#SentimentAnalysis package also provides the ability to create custom dictionary 
#this is done by aligning words to response variable

cust_dict <- master %>% 
  inner_join(indicators) %>% 
  drop_na(cum_abnormal_return_next_day) %$% 
  generateDictionary(documents_pos_tagged,cum_abnormal_return_next_day, 
                     modelType="ridge", family="binomial")

cust_dict$intercept <- NULL
cust_dict <- as.data.frame(matrix(unlist(cust_dict), nrow=length(unlist(cust_dict[1]))))
names(cust_dict) <- c("word", "sentiment", "idf")
cust_dict$sentiment <- as.numeric(cust_dict$sentiment)

sent <- tokens %>% 
  inner_join(cust_dict) %>%
  group_by(doc_id)  %>% 
  summarise(sentiment_custom = sum(sentiment)) %>% 
  inner_join(sent, by="doc_id")

#calculate add sentiment change columns to sent df
ids <- master %>% select(cik, date, doc_id)

sent_diff <- function(sentiment) {
  sentiment_change <- sentiment - lag(sentiment)
}

sent <- sent %>% 
  inner_join(ids) %>% 
  mutate(cik = factor(cik)) %>% 
  group_by(cik) %>% 
  arrange(date)  %>% 
  mutate(across(where(is.numeric), list(change = ~ sent_diff(.))))

saveRDS(sent, "sent.rds")

rm(cust_dict, sentiment, lm, ids, lm_key, sent_diff, 
   text_with_punctuation, sentimentr_lm, tokens)
meta <- master %>% select(doc_id, GICS_Sub_Industry)
data_for_regression <- sent %>% 
  inner_join(indicators, by = c("doc_id", "date")) %>% 
  inner_join(meta, by = "doc_id") %>% 
  mutate(year = year(date)) %>% 
  left_join(fundamental_indicators, by=c("Symbol", "year")) %>% 
  mutate(cik = factor(cik)) %>% 
  mutate(GICS_Sub_Industry = factor(GICS_Sub_Industry))

Analysis

In this dataset multiple entities are observed across time. Simple cross sectional analysis does not consider unobserved heterogeneity among companies, because firms might have underlying fixed factors not captured by our model. This could be anything from brand value to employee satisfaction. In other words, the data is panel data. To create a robust model this must be controlled for. This can be done through the plm package or manually. Specifying the model manually with the lm package enables further analysis with cross-validation which is not as straightforward with plm. Cross-validation allows us to check whether the model is overfitting and its general predictive power.

# function to store cross validation results in tidy format
store_cv <- function(cv, y, x, to) {
  cv$y <- y
  cv$x <- x
  temp <- data.frame(model = unlist(cv))
  temp$measure <- rownames(temp)
  temp <- pivot_wider(temp,names_from = measure, values_from = model)
  return(rbind(to, temp))
}

Function regression2DwithCVis takes a list of dependent variables, independent variables, and control variables. Regression is done on each pair of dependent and independent variables whilst controlling for specified variables. 10-Fold cross validation is preformed and results stored using function defined above. Stargazer is used to display the results for regression and cross validation.

cv_resutls <- data.frame()

regression2DwithCV <- function(dv_names, iv_names, controls, name) {
  
  sentiment_models <- list()
  
  for (y in dv_names){
    for (x in iv_names) {
      x <- paste(x, control, sep = "+")
      form <- formula(paste(y, "~", x))
      model <- lm(form, data=data_for_regression,  x = TRUE, y = TRUE) 
      sentiment_models[[y]][[x]] <- model
      cv_model <- cv.lm(model, k = 10, seed = 123, max_cores = detectCores() - 1)
      cv_resutls <<- store_cv(cv_model, y, x, cv_resutls)
    }
  }
  
  for (y in sentiment_models) {
    #link <- combURL("PartB", paste0(all.vars(formula(y[[1]]))[1], "_", name, ".html"))
    #print(name)
    #stargazer::stargazer(y, type = "html", omit="cik", out = link)
    #print(tab_model(y, collapse.ci = TRUE, collapse.se = TRUE))
  }
}
control <- c("GICS_Sub_Industry")
control <- paste(control, collapse = '+')
dv_names <- c("cum_abnormal_return_next_week", "avg_abnormal_return_next_week", 
                 "avg_abnormal_return_next_day", "cum_abnormal_return_next_day",
                 "avg_abnormal_volume_next_day","cum_abnormal_volume_next_day", 
              "volume_direction_next_day", "variance_next_week")

iv_names <- c("roe", "roa", "debt_equity_ratio", 
              "eps_earnings_per_share_diluted", 
              "pe_ratio", "roc", "ma7", "rsi", "obv")

control <- c("GICS_Sub_Industry")
control <- paste(control, collapse = '+')

regression2DwithCV(dv_names,iv_names, control, name="base_model-1")

iv_names <- c("roe + ma7", "roa + ma7", "debt_equity_ratio + ma7", 
              "eps_earnings_per_share_diluted + ma7", 
              "pe_ratio + ma7", "roc + ma7", "rsi + ma7", "obv + ma7")

regression2DwithCV(dv_names,iv_names, control, name="base_model-2")


control <- c("cik", "GICS_Sub_Industry", "debt_equity_ratio", "ma7")
control <- paste(control, collapse = '+')

iv_names <- c("sentiment_nrc", 
             "sentiment_bing", 
             "sentiment_afinn", 
             "SentimentGI", 
             "SentimentHE", 
             "SentimentLM", 
             "SentimentQDAP", 
             "RatioUncertaintyLM", 
             "sentimentr_lm", 
             "sentimentr", 
             "sentiment_custom")

regression2DwithCV(dv_names,iv_names, control, name="just_sentiment")

iv_names <- c("sentiment_nrc_change", 
             "sentiment_bing_change", 
             "sentiment_afinn_change", 
             "SentimentGI_change", 
             "SentimentHE_change", 
             "SentimentLM_change", 
             "SentimentQDAP_change", 
             "RatioUncertaintyLM_change", 
             "sentimentr_lm_change", 
             "sentimentr_change", 
             "sentiment_custom_change")

regression2DwithCV(dv_names,iv_names, control, name="sentiment_change")


iv_names <- c("NegativityGI", 
             "PositivityGI", 
             "NegativityHE", 
             "PositivityHE", 
             "PositivityLM", 
             "NegativityLM", 
             "NegativityQDAP", 
             "PositivityQDAP")

regression2DwithCV(dv_names,iv_names, control, name="neg_pos")

iv_names <- c("NegativityGI_change", 
             "PositivityGI_change", 
             "NegativityHE_change", 
             "PositivityHE_change", 
             "PositivityLM_change", 
             "NegativityLM_change", 
             "NegativityQDAP_change", 
             "PositivityQDAP_change")

regression2DwithCV(dv_names,iv_names, control, name="neg_pos_change")

iv_names <- c("anger", 
             "anticipation", 
             "disgust", 
             "fear", 
             "joy", 
             "sadness", 
             "surprise")

regression2DwithCV(dv_names,iv_names, control, name="emotion")
regression2DwithCV(dv_names, paste(iv_names, collapse = '+' ), control)


iv_names <- c("anger_change", 
             "anticipation_change", 
             "disgust_change", 
             "fear_change", 
             "joy_change", 
             "sadness_change", 
             "surprise_change")

regression2DwithCV(dv_names,iv_names, control, name="emotion_change")
regression2DwithCV(dv_names, paste(iv_names, collapse = '+' ), control)

iv_names <- c("SentimentLM", 
             "NegativityLM", 
             "disgust", 
             "disgust_change", 
             "sadness_change", 
             "sentiment_nrc")
regression2DwithCV(dv_names, paste(iv_names, collapse = '+' ), control, name="combined")

saveRDS(cv_resutls, "PartB/cv_results.rds")
cv_results <- readRDS("PartB/cv_results.rds")

cvs <- cv_results %>% 
    group_by(y) %>% 
    arrange(MAE.mean) %>% 
    top_n(5) %>% 
    select(MAE.mean, MAE.sd, y, x) %>% 
    group_split() 
## Selecting by x
#%>% 
    #knitr::kable() %>% 
    #kable_styling(position = "center")

cvs <- lapply(cvs, as.data.frame)

#stargazer::stargazer(cvs, summary = rep(F,length(cvs)), type = "text",no.space=TRUE)

for (table in cvs) {
  print(knitr::kable(table))
}
MAE.mean MAE.sd y x
0.0091782386533657 0.00150966428059187 avg_abnormal_return_next_day surprise_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.00918275577718206 0.0014864845553651 avg_abnormal_return_next_day sentimentr_lm_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.00937887457994281 0.00157252778146152 avg_abnormal_return_next_day sentimentr_lm+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.00942693109529079 0.00163577620450724 avg_abnormal_return_next_day surprise+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.00944016648277091 0.00157738589935523 avg_abnormal_return_next_day sentimentr+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
MAE.mean MAE.sd y x
0.00488252030219267 0.000792097772451372 avg_abnormal_return_next_week sentimentr_lm+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.00488480213504792 0.000827350956256574 avg_abnormal_return_next_week sentimentr+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.0048971316189137 0.000803705505237141 avg_abnormal_return_next_week surprise+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.00506523396665765 0.00101059136594638 avg_abnormal_return_next_week sentimentr_lm_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.00511027038066066 0.000973603470904424 avg_abnormal_return_next_week surprise_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
MAE.mean MAE.sd y x
0.285823179466672 0.0332135958018523 avg_abnormal_volume_next_day sentimentr_lm_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.286055675928028 0.0305144248874301 avg_abnormal_volume_next_day sentimentr_lm+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.286720532002201 0.0303513796748044 avg_abnormal_volume_next_day surprise+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.28773818023958 0.02961373174355 avg_abnormal_volume_next_day sentimentr+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.289037804342597 0.034031508505347 avg_abnormal_volume_next_day surprise_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
MAE.mean MAE.sd y x
0.0183564773067314 0.00301932856118374 cum_abnormal_return_next_day surprise_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.0183655115543641 0.00297296911073021 cum_abnormal_return_next_day sentimentr_lm_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.0187577491598856 0.00314505556292304 cum_abnormal_return_next_day sentimentr_lm+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.0188538621905816 0.00327155240901447 cum_abnormal_return_next_day surprise+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.0188803329655418 0.00315477179871046 cum_abnormal_return_next_day sentimentr+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
MAE.mean MAE.sd y x
0.0246264358413286 0.00396814378728267 cum_abnormal_return_next_week sentimentr_lm+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.0246437007683787 0.00415640509804959 cum_abnormal_return_next_week sentimentr+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.024726079602924 0.0040109198300447 cum_abnormal_return_next_week surprise+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.0255328776878948 0.0051408946575127 cum_abnormal_return_next_week sentimentr_lm_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.0257899431787833 0.00492203229343267 cum_abnormal_return_next_week surprise_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
MAE.mean MAE.sd y x
0.571646358933344 0.0664271916037046 cum_abnormal_volume_next_day sentimentr_lm_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.572111351856057 0.0610288497748603 cum_abnormal_volume_next_day sentimentr_lm+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.573441064004402 0.0607027593496088 cum_abnormal_volume_next_day surprise+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.575476360479161 0.0592274634870999 cum_abnormal_volume_next_day sentimentr+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.578075608685193 0.0680630170106941 cum_abnormal_volume_next_day surprise_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
MAE.mean MAE.sd y x
0.000230415809367263 6.85018531782221e-05 variance_next_week sentimentr_lm+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.000230964051486379 6.90107158300945e-05 variance_next_week sentimentr+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.000231354207990225 6.22479921024627e-05 variance_next_week surprise+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.000232941853469224 4.48720128309396e-05 variance_next_week surprise_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.000236066889363565 3.90054950919969e-05 variance_next_week sentimentr_lm_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
MAE.mean MAE.sd y x
0.409198809719714 0.0395613828604193 volume_direction_next_day sentimentr_lm_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.409492450027264 0.0423304681838998 volume_direction_next_day surprise_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.420026866804643 0.0460438835979588 volume_direction_next_day surprise+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.420072765324612 0.0464012455806698 volume_direction_next_day sentimentr_lm+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.420807588254073 0.0455501247626208 volume_direction_next_day sentimentr+cik+GICS_Sub_Industry+debt_equity_ratio+ma7

Surprise_change is best at predicting cum_abnormal_return_next_day and avg_abnormal_return_next_day. Sentimentr_lm and sentimentr_lm_change is leads in terms of predictive power for all other financial indicators. Note, however, none of these variables are statisticaly significant according to regression results. For detailed results of regression analysis see Appendix.

control <- c("cik", "GICS_Sub_Industry", "debt_equity_ratio", "ma7")
control <- paste(control, collapse = '+')
sentiment_models <- list()
dv_names_binary <- c("movment_direction")

iv_names <- c("sentiment_nrc", 
             "sentiment_bing", 
             "sentiment_afinn", 
             "SentimentGI", 
             "SentimentHE", 
             "SentimentLM", 
             "SentimentQDAP", 
             "RatioUncertaintyLM", 
             "sentimentr", 
             "NegativityGI", 
             "PositivityGI", 
             "NegativityHE", 
             "PositivityHE", 
             "PositivityLM", 
             "NegativityLM", 
             "NegativityQDAP", 
             "PositivityQDAP",
             "sentiment_nrc_change", 
             "sentiment_bing_change", 
             "sentiment_afinn_change", 
             "SentimentGI_change", 
             "SentimentHE_change", 
             "SentimentLM_change", 
             "SentimentQDAP_change", 
             "RatioUncertaintyLM_change", 
             "sentimentr_change", 
             "NegativityGI_change", 
             "PositivityGI_change", 
             "NegativityHE_change", 
             "PositivityHE_change", 
             "PositivityLM_change", 
             "NegativityLM_change", 
             "NegativityQDAP_change", 
             "PositivityQDAP_change")

for (y in dv_names_binary){
  for (x in iv_names) {
    x <- paste(x, control, sep = "+")
    form <- formula(paste(y, "~", x))
    sentiment_models[[y]][[x]] <- glm(form, data=data_for_regression, family = "binomial") 
  }
}

name <- "test"

for (y in sentiment_models) {
  name <- combURL("PartB", paste0(all.vars(formula(y[[1]]))[1], "-", name, ".tex"))
  stargazer::stargazer(y, type = "latex", out = name)
}

Part C: Topic Modeling

A topic is a bag of words where each word is assigned a probability of belonging to the topic. A document consists of multiple topics of varying proportions.

In this part of the study we use Structural Topic modeling to discover topics in the textual data. The main advantage of STM over LDA or CTM is that STM allows users to include document metadata into the model (Roberts, 2016). The topic proportions prevalence and topic content in a document can thus be associated with metadata.

The topics discussed by management depend on the industry the company is in as was clearly illustrated in Part A tf-idf token exploration. Likely the time at which the report was written also has an effect on topics mentioned. For instance in 2020 reports we expect topics related to hygiene to be more prevalent. Therefore we include sub-industry and temporal variables as stm model covariates. Primarily, we want to examine the relationship between the return on a stock and the topics discussed. Therefore cumulative abnormal returns are added as a prevalence factor. Additionally, company features such as price to earnings ratio and debt to equity ratio are included since they were found to be good predictors of abnormal returns. For instance, it is possible that management of a company with a higher debt to equity ratio will focus on debt in the management discussion section of the report. Technical indicators are not included since they only reflect the short term nature of a company’s price movements rather than some intrinsic company state. Therefore, there is no plausible mechanism by which they can affect topics discussed.

According to Stewart (2020) prevalence covariates are not particularly sensitive to the number of metadata used, however content covariates are. Therefore whilst adding a number of prevalence covariates appears reasonable altering numerous content covariates less so. Moreover, adding content covariates removes the ability to carry out Search K. Since, selecting k is perhaps the most important decision in this type of analysis, no content covariates are added.

Spectral initialization is recommended by Roberts et al., (2016). It outperforms LDA and random initialization. Spectral initialization also returns consistent topics by focusing on anchor words (Mourtgos & Adams, 2019). According to stm documentation a rough optimal guess of optimal topic numbers is from up to 50 for a corpus of a few 100s documents. Therefore, search K in this study is carried out for k between 2 to 60.

Stm output is given in the form of words ranked by frequency, probability, lift and score in a topic. To label topics the main focus is on FREX measure which weights the frequency of a word’s appearance with its exclusiveness to the topic. A conceptually parallel to tf-idf can be drawn here.

Data Preperation

meta <- master %>% 
  select(cik, GICS_Sub_Industry, documents_pos_tagged, date) %>% 
  mutate(year = year(date), cik = factor(cik), GICS_Sub_Industry = factor(GICS_Sub_Industry))  

data_for_stm <- data_for_regression %>% 
  select(cum_abnormal_return_next_week, cum_abnormal_return_next_day, debt_equity_ratio, pe_ratio, cik,  year) %>% 
  inner_join(meta, by=c("cik", "year")) %>% select(-cik, -date) 

rm(data_for_regression)

processed <- textProcessor(data_for_stm$documents_pos_tagged,
                           metadata = data_for_stm,
                           stem = F)
## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Removing numbers... 
## Creating Output...
threshold <- round(1/100 * length(processed$documents),0)

out <- prepDocuments(processed$documents,
                     processed$vocab,
                     processed$meta,
                     lower.thresh = threshold)
## Removing 64 of 3169 terms (166 of 203687 tokens) due to frequency 
## Your corpus now has 326 documents, 3105 terms and 203521 tokens.

Search K

k_values = seq(from=2,to=60,by=2)
search_k_results <- searchK(out$documents, out$vocab, K=k_values,
                            N = floor(0.1*length(out$documents)),
                            prevalence = ~cum_abnormal_return_next_day + 
                              factor(cik) + 
                              s(year) + 
                              GICS_Sub_Industry + 
                              pe_ratio + 
                              debt_equity_ratio,
                            cores = 2,
                            data=out$meta)
k_values = seq(from=2,to=12,by=1)
search_k_results_deep <- searchK(out$documents, out$vocab, K=k_values,
                            N = floor(0.1*length(out$documents)),
                            prevalence = ~cum_abnormal_return_next_day + 
                              s(year) + 
                              factor(cik) + 
                              GICS_Sub_Industry + 
                              pe_ratio + 
                              debt_equity_ratio,
                            cores = 2,
                            data=out$meta)
k_values = seq(from=8,to=20,by=1)
search_k_results_deep <- searchK(out$documents, out$vocab, K=k_values,
                            N = floor(0.1*length(out$documents)),
                            prevalence = ~cum_abnormal_return_next_day + 
                              s(year) + 
                              factor(cik) + 
                              GICS_Sub_Industry + 
                              pe_ratio + 
                              debt_equity_ratio,
                            cores = 2,
                            data=out$meta)
Search KSearch K

Search K

Unsurprisingly, semantic coherence is very high when few topics are present. This is due to the statistical properties of the technique (Mimno, 2011)(Roberts, 2014). Thus, peaks beyond the initial high values should be looked at. The leveling off of Held-Out Likelihoood and Lower bound curve as well as the through in the residual and peak in semantic coherence plot all suggest a optimal value for K equal to 9.

optimal_k <- 9

Model Tune

optimal_k_models <- selectModel(documents = out$documents, 
                                vocab = out$vocab,
                                K =  optimal_k,
                                prevalence = ~cum_abnormal_return_next_day + 
                                  s(year) + 
                                  GICS_Sub_Industry + 
                                  pe_ratio + 
                                  debt_equity_ratio,
                                max.em.its = 150,
                                gamma.prior='L1',
                                data = out$meta,
                                init.type = "Spectral", 
                                ngroups = 5)
Comparing Models

Comparing Models

For the most topics models have almost the same values for semantic coherence and exclusivity. However, for one of the topics model 2 & 3 outperform all others in terms of semantic coherence by a substantial margin. Model 2 is selected for futher alnalyis as it performs better on exclusivity in some of the other cases.

model <- optimal_k_models$runout[[2]]
save(model, file="PartC/stm_optimal_model.rda")

Results

tidy_summary <- data.frame(FREX = do.call(paste, 
                                          c(as.data.frame(summary(model)$frex), sep=", ")),
                           Lift = do.call(paste, 
                                          c(as.data.frame(summary(model)$lift), sep=", ")),
                           Score = do.call(paste, 
                                           c(as.data.frame(summary(model)$score), sep=", ")),
                           Prob = do.call(paste, 
                                          c(as.data.frame(summary(model)$prob), sep=", "))) %>% 
  mutate(topic = row_number()) %>% 
  select(topic, everything()) 
## A topic model with 9 topics, 326 documents and a 3105 word dictionary.
## A topic model with 9 topics, 326 documents and a 3105 word dictionary.
## A topic model with 9 topics, 326 documents and a 3105 word dictionary.
## A topic model with 9 topics, 326 documents and a 3105 word dictionary.
labels <- c("Lawsuits", "Operationw Abroad" ,  "Software & Hardware", 
            "Accounting Standards", "Distribution and Manufacturing",
            "Insurance", "Digital Royalty", "Card Rewards", "Subscription")
tidy_summary[,2:5] %>% 
  knitr::kable(booktabs = TRUE) %>% 
  pack_rows(index = paste("Topic", c(1:9), ":", labels)) %>% 
  kable_styling(latex_options = "scale_down")
FREX Lift Score Prob
Topic 1 : Lawsuits
class, litigation, court, escrow, complaint, interchange, plaintiff accountants, covered, disagreements, circumvention, overriding, benefits, signatory accountants, class, plaintiff, complaint, escrow, covered, defendant company, asset, share, income, class, common, settlement
Topic 2 : Operationw Abroad
consumer, transfer, agent, region, negatively, versus, euro reside, arbitrator, constructive, expectancy, generator, illicit, impediment lines, consumer, border, check, segment, processing, agent rate, revenue, income, currency, foreign, expense, business
Topic 3 : Software & Hardware
hardware, percent, support, software, update, system, premise decreasing, deflationary, elements, optimistic, picture, post-combination, quantified hardware, deflationary, support, subscription, software, middleware, license revenue, service, software, expense, product, customer, support
Topic 4 : Accounting Standards
non-gaap, investor, mutual, company, proxy, communication, measure academia, accountable, forecasts, imprecise, reaction, sub-section, biomedical company, chemistry, non-gaap, proxy, earnings, perpetual, investor company, revenue, income, expense, related, result, rate
Topic 5 : Distribution and Manufacturing
client, payroll, insurance, fund, processing, online, solution intermediate, quicken, calculate, centralize, embezzlement, exhaustive, garnishment client, centralize, payroll, worker, segment, processing, service revenue, service, income, client, rate, total, business
Topic 6 : Insurance
wafer, distributor, inventory, fabrication, memory, manufacturing, shipment advantages, averse, bifurcation, constantly, controllers, dense, enthusiast wafer, distributor, fabrication, inventory, memory, gigabit, semiconductor product, income, expense, rate, primarily, market, result
Topic 7 : Digital Royalty
digital, royalty, device, creative, media, wireless, circuit authoring, acrobat, advertiser, cost-sensitive, foregone, hobbyist, localization risky, subscription, wireless, creative, circuit, modem, acrobat revenue, related, primarily, income, product, expense, increase
Topic 8 : Card Rewards
fuel, mile, label, reward, private, spread, redemption accessory, accordion, apparel, bankrupt, branded, coalition, collector fuel, mile, label, conduit, reward, fleet, grocery credit, revenue, rate, increase, income, expense, transaction
Topic 9 : Subscription
maintenance, subscription, observable, billing, professional, input, privately ample, convention, correct, diligent, drawdown, forint, freight subscription, maintenance, forint, hardware, perpetual, upfront, seat revenue, product, cost, increase, service, expense, asset
tidy_gamma <- tidy(model, matrix = "gamma", document_names = rownames(out$meta))

tidy_gamma %>%
  group_by(topic) %>%
  summarise(gamma = mean(gamma)) %>%
  arrange(desc(gamma)) %>%
  mutate(topic = reorder(topic, gamma)) %>% 
  ggplot(aes(topic, gamma, label = labels, fill = topic)) +
  geom_col(show.legend = FALSE, alpha = 0.8) +
  geom_text(hjust = 1.2, nudge_y = 0.0005, size = 10, color='white') +
  coord_flip() +
  theme_light(base_size = 22)  +
  labs(x = NULL, y = expression(gamma),
       title = "Top topics by prevalence in 10-K reports")

ggsave("PartC/topic_proportions_in_corpus.png", 
       width = 50, height = 35, units = "cm")
Topic Proportions across Industries

Topic Proportions across Industries

Topic Proportions across Industries

tidy_gamma  %>% 
  pivot_wider(id_cols=document, names_from = topic, values_from = gamma) %>% 
  cbind(meta) %>% 
  select(-documents_pos_tagged, -year, -cik) %>% 
  group_by(GICS_Sub_Industry) %>% 
  summarise(across(where(is.numeric), mean)) %>% 
  pivot_longer(!GICS_Sub_Industry, names_to = "topic", values_to = "gamma") %>% 
  mutate(topic = factor(topic)) %>% 
  ggplot() + 
  geom_bar(aes(x = topic, y = gamma, fill =topic), alpha = 0.8, stat = "identity") + 
  facet_wrap(.~GICS_Sub_Industry) + 
  theme_light(base_size = 22) + 
  theme(strip.background=element_blank(), 
        strip.text=element_text(colour = 'black', face = "bold", size = 17)) + 
  xlab("Topic") + ylab("Mean Gamma") + 
  scale_fill_discrete(name="Legend",labels=labels)

ggsave("PartC/topic_proportions_across_industries.png", 
       width = 50, height = 35, units = "cm")

On this graph variation in topics across industries are seen.

Topic Proportions across Industries

Topic Proportions across Industries

Document distributions across topics

#exclude all docs with prob less than 1 percent
#allows for better examination
ggplot(tidy_gamma %>% filter(gamma > 0.01), 
       aes(gamma, fill = as.factor(topic))) +
  geom_histogram(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~ topic, ncol = 3) +
  labs(title = "Document probabilities distribution per topic",
       y = "Number of reports", x = expression(gamma)) +
  theme_light(base_size = 22)

ggsave("PartC/document_probabilities_distribution.png",
       width = 45, height = 45, units = "cm")
Topic Proportions across Industries

Topic Proportions across Industries

Each topic is strongly associated with some of the documents and less so with other.

effects_return <- estimateEffect(1:optimal_k ~debt_equity_ratio, stmobj = model, meta = out$meta)

margin1 <- as.numeric(quantile(out$meta$debt_equity_ratio)[2])
margin2 <- as.numeric(quantile(out$meta$debt_equity_ratio)[4])

plot(effects_return, covariate = "debt_equity_ratio",
     topics = 1:optimal_k,
     model = model, method = "difference",
     cov.value1 = margin2, cov.value2 = margin1,
     xlab = "Low Debt ... High Debt",
     xlim = c(-0.01,0.01),
     main = "Marginal change on topic probabilities for low and high price",
     custom.labels = labels,
     ci.level = 0.05,
     labeltype = "custom")

effects_return <- estimateEffect(1:optimal_k ~pe_ratio, stmobj = model, meta = out$meta)

margin1 <- as.numeric(quantile(out$meta$pe_ratio)[2])
margin2 <- as.numeric(quantile(out$meta$pe_ratio)[4])

plot(effects_return, covariate = "pe_ratio",
     topics = 1:optimal_k,
     model = model, method = "difference",
     cov.value1 = margin2, cov.value2 = margin1,
     xlab = "Low PE ... High PE",
     xlim = c(-0.01,0.01),
     main = "Marginal change on topic probabilities for low and high PE ratio",
     custom.labels = labels,
     ci.level = 0.05,
     labeltype = "custom")

effects_return <- estimateEffect(1:optimal_k ~cum_abnormal_return_next_day, stmobj = model, meta = out$meta)

margin1 <- as.numeric(quantile(out$meta$cum_abnormal_return_next_week)[2])
margin2 <- as.numeric(quantile(out$meta$cum_abnormal_return_next_week)[4])

plot(effects_return, covariate = "cum_abnormal_return_next_day",
     topics = 1:optimal_k,
     model = model, method = "difference",
     cov.value1 = margin2, cov.value2 = margin1,
     xlab = "Low Price ... High Price",
     xlim = c(-0.05,0.05),
     main = "Marginal change on topic probabilities for low and high price",
     custom.labels = labels,
     ci.level = 0.05,
     labeltype = "custom")

effects_year <- estimateEffect(1:optimal_k ~s(year), stmobj = model, meta = out$meta)

plot(effects_year, covariate = "year",
     topics = 1:optimal_k,
     model = model, method = "continuous",
     xlab = "Past ... Present",
     main = "Marginal change on topic probabilities across years",
     custom.labels =labels,
     ci.level = 0.05,
     labeltype = "custom")

topic_correlations <- topicCorr(model) 
plot.topicCorr(topic_correlations,
               vlabels = labels,
               vertex.color = "#CDF0EA", 
               vertex.label.cex =01, 
               vertex.size=30, 
               vertex.label.color="#053742")

These charts depict the marginal probability of topic prevalence as variable changes.

Regression

tidy_theta <- as.data.frame(model$theta)
colnames(tidy_theta) <- paste0("topic_",1:9)
tidy_theta <- cbind(out$meta,tidy_theta)
topics <- paste0("topic_", 1:9)
iv_names <- paste(c(topics , "cik"), collapse = " + ")
lm_model <- lm(paste("cum_abnormal_return_next_day ~", iv_names), tidy_theta)
tidy(lm_model) %>% kable() %>%  
  kable_styling(position = "center")
term estimate std.error statistic p.value
(Intercept) -0.0045039 0.0093844 -0.4799320 0.6316400
topic_1 -0.0146199 0.0095478 -1.5312351 0.1268090
topic_2 0.0064744 0.0068529 0.9447689 0.3455688
topic_3 0.0070618 0.0082127 0.8598637 0.3905794
topic_4 0.0112073 0.0087023 1.2878567 0.1988297
topic_5 -0.0081481 0.0079406 -1.0261292 0.3056918
topic_6 -0.0000164 0.0058805 -0.0027942 0.9977724
topic_7 0.0024948 0.0076385 0.3266045 0.7442043
topic_8 0.0037274 0.0079790 0.4671466 0.6407482
topic_9 NA NA NA NA
cik4127 0.0025809 0.0116364 0.2217923 0.8246328
cik6281 0.0066921 0.0116626 0.5738114 0.5665433
cik723125 0.0079608 0.0115370 0.6900222 0.4907359
cik723531 -0.0002769 0.0123163 -0.0224848 0.9820768
cik743316 -0.0024732 0.0117525 -0.2104429 0.8334708
cik743988 0.0052937 0.0115885 0.4568068 0.6481543
cik769397 0.0162905 0.0116651 1.3965126 0.1636355
cik779152 0.0046571 0.0116263 0.4005662 0.6890365
cik796343 -0.0005136 0.0116500 -0.0440898 0.9648633
cik798354 0.0077926 0.0116638 0.6681061 0.5046009
cik804328 0.0012401 0.0116892 0.1060883 0.9155861
cik813672 0.0042170 0.0115978 0.3636017 0.7164222
cik827054 0.0068285 0.0115704 0.5901682 0.5555406
cik849399 -0.0005927 0.0115451 -0.0513378 0.9590919
cik877890 0.0086757 0.0116911 0.7420753 0.4586465
cik883241 0.0059439 0.0115503 0.5146102 0.6072201
cik896878 0.0052314 0.0116270 0.4499312 0.6530985
cik1013462 -0.0076748 0.0116263 -0.6601238 0.5097020
cik1045810 -0.0085638 0.0116545 -0.7348037 0.4630569
cik1101215 -0.0043800 0.0116667 -0.3754310 0.7076163
cik1108524 0.0070212 0.0116741 0.6014353 0.5480232
cik1123360 -0.0122103 0.0118764 -1.0281202 0.3047560
cik1136893 0.0076943 0.0115983 0.6634022 0.5076036
cik1141391 0.0034293 0.0116159 0.2952265 0.7680335
cik1175454 0.0043529 0.0119242 0.3650481 0.7153434
cik1341439 0.0146612 0.0115752 1.2666097 0.2063183
cik1365135 0.0033617 0.0117230 0.2867624 0.7745004
cik1383312 0.0322571 0.0116135 2.7775440 0.0058367
cik1403161 -0.0001431 0.0115294 -0.0124122 0.9901054
glance(lm_model) %>% kable() %>% 
  kable_styling(position = "center")
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.1124705 -0.0015524 0.0269775 0.9863856 0.4966375 37 735.3848 -1392.77 -1245.081 0.2096024 288 326

None of the topics are usefull in predicting abnormal returns.

Unsupervised Model

auto_stm_model <- stm(documents = out$documents, 
                      vocab = out$vocab,
                      K = 0,
                      prevalence =~cum_abnormal_return_next_day + s(year) + 
                        factor(cik) + 
                        GICS_Sub_Industry + 
                        pe_ratio + 
                        debt_equity_ratio, 
                      max.em.its = 150,
                      gamma.prior='L1',
                      data = out$meta,
                      init.type = "Spectral", 
                      ngroups = 5)
save(auto_stm_model, file="PartC/auto_stm_model.rda")
load("PartC/auto_stm_model.rda")
tidy_summary <- data.frame(FREX = do.call(paste, c(as.data.frame(summary(auto_stm_model)$frex), sep=", ")),
                           Lift = do.call(paste, c(as.data.frame(summary(auto_stm_model)$lift), sep=", ")),
                           Score = do.call(paste, c(as.data.frame(summary(auto_stm_model)$score), sep=", ")),
                           Prob = do.call(paste, c(as.data.frame(summary(auto_stm_model)$prob), sep=", "))) %>% 
  mutate(topic = row_number()) %>% 
  select(topic, everything()) 
## A topic model with 55 topics, 326 documents and a 3105 word dictionary.
## A topic model with 55 topics, 326 documents and a 3105 word dictionary.
## A topic model with 55 topics, 326 documents and a 3105 word dictionary.
## A topic model with 55 topics, 326 documents and a 3105 word dictionary.
tidy_summary[,2:5] %>% 
  knitr::kable(booktabs = TRUE) %>% 
  pack_rows(index = paste("Topic", c(1:55))) %>% 
  kable_styling(latex_options = "scale_down")
FREX Lift Score Prob
Topic 1
executive, disclosure, control, decision, resource, addition, accounting accountants, control, disclosure, executive, decision, officer, resource accountants, control, executive, decision, disclosure, officer, resource control, disclosure, executive, decision, addition, accounting, management
Topic 2
class, court, plaintiff, complaint, defendant, benefits, motion benefits, preliminarily, investigative, declaratory, settled, co-defendant, unlawful benefits, class, plaintiff, complaint, defendant, escrow, court company, asset, class, share, common, income, note
Topic 3
client, payroll, insurance, fund, administration, worker, associate centralize, topics, ancillary, attendance, paid, intermediate, garnishment client, payroll, insurance, check, worker, centralize, remittance client, service, investment, rate, income, fund, revenue
Topic 4
limitation, procedure, circumvention, effectiveness, inherent, possibility, constraint circumvention, procedure, accountant, constraint, error, misstatement, possibility circumvention, procedure, effectiveness, absolute, error, constraint, control control, reporting, reasonable, procedure, internal, limitation, effectiveness
Topic 5
mutual, proxy, communication, earnings, advisor, consisting, investor closed, nontaxable, post-retirement, newsletter, registrar, contests, non-trade proxy, mutual, earnings, closed, advisor, non-gaap, investor revenue, increase, company, expense, income, rate, earning
Topic 6
accompanying, merger, check, line, unfavorable, trademark, network combat, pronouncements, lines, restrictive, commits, incoming, membership combat, check, accompanying, merger, vertical, trademark, membership income, credit, revenue, asset, rate, period, expense
Topic 7
update, hardware, support, software, comparison, education, license computationally, annualize, arranger, post-combination, middleware, shortly, codification hardware, middleware, update, computationally, software, education, support revenue, product, software, expense, hardware, support, rate
Topic 8
mutual, earnings, proxy, outsourcing, distribution, communication, broker correspondent, weights, non-compliance, securities, archival, piece, midrange correspondent, earnings, proxy, mutual, outsourcing, broker, client revenue, company, operation, service, income, agreement, increase
Topic 9
description, emulation, warrant, maintenance, hardware, restructuring, conversion cryptocurrency, convention, industrialized, product-specific, blueprint, logical, contemplation hardware, emulation, maintenance, cryptocurrency, description, warrant, conversion product, revenue, change, note, cost, related, asset
Topic 10
class, pension, nominal, claim, sustainable, convert, differential defendants, honor, panel, violate, complaints, pre-trial, omnibus defendants, class, interchange, escrow, pension, client, litigation company, income, asset, share, class, common, note
Topic 11
implementation, standard, standalone, stream, practical, delivery, outsourcing deflationary, randomly, multi-element, organizations, non-essential, invested, survey deflationary, survey, outsourcing, processing, hardware, remittance, maintenance revenue, service, cost, customer, related, company, product
Topic 12
microprocessor, graphics, amendment, indenture, chipset, processor, shipment averse, derivation, enthusiast, freedom, meaningfully, dense, semi-custom microprocessor, graphics, wafer, dense, shipment, semi-custom, chipset expense, related, primarily, amount, product, income, decrease
Topic 13
system, independent, report, management, event, effective, accounting disagreements, system, independent, public, management, report, control system, disagreements, independent, report, public, control, internal system, independent, management, report, future, accounting, effective
Topic 14
distributor, signal, assembly, microcontroller, debenture, semiconductor, capacity distributors, inappropriate, offshore, rare, interface, uncommon, signal distributor, microcontroller, wafer, debenture, semiconductor, assembly, memory product, distributor, approximately, income, cost, acquisition, amount
Topic 15
insurance, client, payroll, worker, worksite, fund, administration embezzlement, flex, worksite, usual, facing, renovation, overnight client, insurance, embezzlement, worksite, worker, payroll, non-gaap client, service, income, rate, investment, fund, share
Topic 16
return, allowances, pre-tax, research, title, derivative, contingent breakout, end-customer, exclusivity, indicators, non-warranty, differences, post-shipment exclusivity, inventory, distributor, wafer, non-warranty, research, shipment income, asset, revenue, expense, primarily, rate, product
Topic 17
maintenance, perpetual, license, upfront, professional, chip, criterion ample, hotline, drawdown, forint, fronts, mentioned, post-customer maintenance, perpetual, upfront, hardware, shipment, chip, functionally revenue, license, increase, customer, term, service, cost
Topic 18
fuel, wholesale, organic, fleet, spread, macroeconomic, network toll, undivided, acceptability, diagram, expansive, fleets, gallon fuel, wholesale, fleet, organic, transportation, spread, gallon revenue, income, transaction, rate, fuel, facility, impact
Topic 19
mile, reward, label, database, private, cardholder, redemption grocery, fashion, woman, furnishings, email, trusts, apparel mile, reward, label, cardholder, database, breakage, sponsor credit, increase, rate, service, revenue, expense, mile
Topic 20
seat, geography, non-gaap, subscription, reseller, maintenance, suite curricula, deploy, digitally, disciplinary, downloadable, educator, expositions subscription, non-gaap, seat, maintenance, horizontal, geography, reseller revenue, product, increase, expense, business, cost, primarily
Topic 21
consumer, agent, transfer, location, region, rating, paper intra-country, re-balancing, uncertainties, imposing, migrant, constructive, consumers intra-country, consumer, agent, border, rating, transfer, region revenue, rate, business, transaction, consumer, foreign, income
Topic 22
non-gaap, simulation, perpetual, maintenance, operational, lease, investor sub-section, academia, accountable, biomedical, chemical, chemistry, copyright non-gaap, maintenance, perpetual, company, simulation, investor, matrix company, revenue, income, expense, related, result, rate
Topic 23
label, private, mile, reward, redemption, sponsor, collector merchandise, harbor, catalog, label, year, collector, permission mile, label, reward, private, collector, breakage, merchandise credit, service, revenue, rate, mile, private, reward
Topic 24
circuit, wireless, device, royalty, spectrum, marketable, patent multimode, -process, circuits, codec, laptops, messaging, nonfunctional wireless, circuit, device, multimode, licensee, royalty, marketable revenue, related, rate, product, primarily, increase, asset
Topic 25
communication, proxy, wealth, mutual, investor, retirement, earnings multi-asset, borrowed, wealth, post-employment, mailing, sell, clearance multi-asset, wealth, proxy, earnings, mutual, non-gaap, investor company, revenue, service, management, income, activity, asset
Topic 26
percent, subscription, professional, invoice, renewal, billing, non-gaap co-location, contributor, motivate, multi-tenant, undeveloped, impending, parking subscription, non-gaap, percent, absolute, invoice, multi-tenant, billing revenue, service, expense, customer, total, percent, increase
Topic 27
border, rebate, euro, versus, issuer, local, incentive nonstandard, numerical, reconcile, non-european, encouraging, neutral, warning neutral, border, rebate, euro, versus, cardholder, non-gaap currency, expense, revenue, foreign, income, rate, customer
Topic 28
premise, hardware, license, index, infrastructure, support, swap non-oracle, summation, interoperable, interest, host, firmware, upward hardware, premise, non-oracle, index, swap, marketable, deployment revenue, license, service, expense, rate, currency, hardware
Topic 29
debit, guarantee, check, intrusion, ticket, channel, associations ticket, non-routine, resell, restaurants, intrusion, associations, dishonored check, ticket, debit, intrusion, non-routine, associations, interchange service, credit, facility, rate, income, revenue, loss
Topic 30
banking, check, consolidation, swap, incremental, variance, processing corrupt, disrupt, non-traditional, sophistication, steal, surviving, detrimental check, banking, client, earnings, processing, swap, non-traditional revenue, service, rate, operation, income, period, business
Topic 31
comprehensive, damage, court, objection, derivative, liabilities, incentive objection, unadjusted, argument, appellate, alleged, intra-entity, allegedly objection, interchange, complaint, damage, plaintiff, class, court asset, income, company, loss, share, liability, foreign
Topic 32
overriding, internal, reporting, procedure, effectiveness, control, misstatement overriding, inadequate, misstatement, detection, authorizations, procedure, fairly overriding, procedure, misstatement, reporting, control, effectiveness, internal control, internal, reporting, management, share, procedure, acquisition
Topic 33
storage, content, server, authoritative, joint, indemnification, undelivered controversy, authentication, liberal, outweigh, partnering, taxpaying, tolling subscription, storage, partnering, authoritative, server, rebate, content revenue, asset, amount, income, cost, expense, primarily
Topic 34
profit, reserve, material, military, equivalent, research, uncertainty amplifier, precision, revolution, smartphones, multitude, pascal, prohibitive pascal, military, inventory, cellular, research, medical, erosion expense, related, revenue, rate, income, result, asset
Topic 35
desktop, investments, online, segment, staffing, unsecured, payroll exhaustive, nimbly, patient, quicken, suspicious, employ, pre-established desktop, patient, quicken, staffing, segment, payroll, online revenue, income, service, business, expense, segment, total
Topic 36
creative, developer, redundant, stable, prepayment, termination, restructuring shippable, foregone, perpetually, hobbyist, redundant, download, localization creative, perpetually, acrobat, redundant, developer, subscription, element revenue, product, related, primarily, income, expense, cost
Topic 37
digital, media, subscription, document, creative, backlog, offering personalize, cost-sensitive, subscribe, syncing, advertiser, trajectory, photography subscription, media, digital, creative, document, perpetual, personalize revenue, increase, income, digital, primarily, foreign, subscription
Topic 38
implementation, outsourced, hardware, complementary, support, electronic, element picture, decreasing, optimistic, unwavering, non-exclusive, aircraft, bullet hardware, outsourced, outsourcing, element, picture, installation, complementary revenue, service, cost, customer, support, product, software
Topic 39
hardware, update, software, premise, support, comparison, subscription protect, -aservice, elements, instructor, perfunctory, rational, agility hardware, subscription, update, software, premise, support, storage software, revenue, product, hardware, support, service, expense
Topic 40
distributor, microcontroller, assembly, half, auction, capacity, debenture purely, serial, proposal, fail, pre-determined, non-proprietary, unrelated distributor, microcontroller, purely, wafer, debenture, memory, inventory product, distributor, market, income, result, investment, approximately
Topic 41
restructure, realizable, dram, volume, decline, equipment, manufacture re-use, restructure, forecasting, outpace, rolling, qualification, non-trade restructure, dram, outpace, realizable, memory, qualification, inventories product, primarily, cost, amount, income, increase, rate
Topic 42
unsecured, contingent, special, action, employment, demand, distributor instrumentation, roadmaps, tenor, sizing, sold, injury, predominant distributor, inventory, predominant, roadmaps, industrial, categorization, shipment result, income, rate, increase, product, revenue, amount
Topic 43
consumer, transfer, negatively, agent, region, paper, strengthening arbitrator, interconnected, multi-strategy, varied, expectancy, saving, illicit consumer, saving, agent, strengthening, peso, pension, region rate, revenue, currency, foreign, income, consumer, business
Topic 44
division, input, workspace, observable, authoritative, collaboration, virtualization duplication, login, mitigate, observation, password, shrink, turnaround workspace, division, subscription, authoritative, desktop, maintenance, virtualization product, revenue, service, related, primarily, asset, cost
Topic 45
covered, retrospective, litigation, responsibility, escrow, settlement, interchange sponsoring, covered, misstatements, responsibility, retrospective, escrow, non-controll covered, sponsoring, responsibility, litigation, escrow, interchange, retrospective litigation, covered, retrospective, note, responsibility, settlement, provision
Topic 46
transition, provisional, quantitative, pandemic, adoption, distinct, form stockholders, coronavirus, unsatisfied, pandemic, shutdown, non-distributor, capable stockholders, provisional, pandemic, distinct, perpetual, transition, enactment income, change, obligation, amount, rate, time, transition
Topic 47
company, report, presentation, management, statement, event, assumption supervision, company, presentation, reclassification, principle, public, report supervision, company, translation, report, indefinite, audit, public company, management, statement, report, future, acquisition, accounting
Topic 48
notebook, architecture, processor, workstation, game, marketable, warranty builder, custodian, motherboard, multi-core, municipality, navigation, recall inventory, notebook, marketable, rebate, processor, graphics, visual product, revenue, income, market, cost, related, expense
Topic 49
processing, senior, institution, unconsolidated, subscriber, banking, electronic acumen, distinction, sizable, accompany, convey, mild, perception client, processing, transit, unconsolidated, subscriber, banking, thrift revenue, service, rate, income, payment, business, expense
Topic 50
gigabit, dram, memory, venture, flash, production, joint underutilized, severe, tech, multi-chip, density, verdict, successively gigabit, underutilized, dram, memory, flash, wafer, tech product, cost, primarily, result, average, expense, acquisition
Topic 51
assembly, microcontroller, distributor, signal, fabrication, wafer, semiconductor uninsured, gate, virus, dispersion, expirations, wider, adopt distributor, microcontroller, uninsured, wafer, assembly, fabrication, semiconductor product, rate, acquisition, distributor, customer, facility, result
Topic 52
redemption, loyalty, conduit, deposit, mile, consent, offs unredeemed, prevailing, expiry, eliminations, restitution, non-executive, regression mile, unredeemed, loyalty, conduit, reward, expiry, redemption credit, increase, rate, expense, revenue, asset, program
Topic 53
gigabit, dram, supply, flash, memory, output, production width, creditor, gigabits, gigabit, successively, yuan, wind gigabit, dram, memory, wafer, flash, fabrication, width product, cost, result, primarily, agreement, amount, average
Topic 54
comparable, debenture, broadcast, family, shipment, distributor, mainstream foreign-currency, withhold, prom, lengthy, published, salable, wireline debenture, distributor, broadcast, shipment, inventory, wireless, mainstream revenue, product, income, period, rate, market, increase
Topic 55
divestiture, unallocated, identity, protection, enterprise, billing, transition writing, divestiture, identity, unallocated, non-operat, exceptions, varying divestiture, writing, identity, unallocated, billing, protection, metric revenue, primarily, income, expense, result, cost, operation
tidy_theta <- as.data.frame(auto_stm_model$theta)
colnames(tidy_theta) <- paste0("topic_",1:55)
tidy_theta <- cbind(out$meta,tidy_theta)

topics <- paste0("topic_", 1:55)
iv_names <- paste(c(topics , "cik"), collapse = " + ")
lm_model <- lm(paste("cum_abnormal_return_next_day ~", iv_names), tidy_theta)
tidy(lm_model) %>% kable() %>%  
  kable_styling(position = "center")
term estimate std.error statistic p.value
(Intercept) -0.0342238 0.0174691 -1.9591011 0.0512492
topic_1 -0.1601652 0.9372345 -0.1708913 0.8644521
topic_2 0.0161565 0.0270823 0.5965688 0.5513530
topic_3 0.0283555 0.0201120 1.4098833 0.1598580
topic_4 0.1070051 0.1016611 1.0525676 0.2935890
topic_5 0.0349806 0.0230135 1.5200022 0.1298158
topic_6 0.0164919 0.0216484 0.7618089 0.4469159
topic_7 0.0260029 0.0236328 1.1002862 0.2723007
topic_8 0.0494177 0.0206808 2.3895485 0.0176355
topic_9 0.0255111 0.0182664 1.3966191 0.1638077
topic_10 0.0571828 0.0258001 2.2163796 0.0275961
topic_11 0.0660352 0.0213440 3.0938550 0.0022079
topic_12 0.0261464 0.0176575 1.4807478 0.1399745
topic_13 -0.6844999 1.1688602 -0.5856132 0.5586812
topic_14 0.0236586 0.0208292 1.1358359 0.2571489
topic_15 0.1090007 0.0244369 4.4605055 0.0000125
topic_16 0.0316088 0.0186996 1.6903499 0.0922486
topic_17 0.0282858 0.0173990 1.6257179 0.1053112
topic_18 0.0365042 0.0174952 2.0865307 0.0379777
topic_19 -0.0001363 0.0226825 -0.0060069 0.9952121
topic_20 0.0376618 0.0174977 2.1523796 0.0323567
topic_21 0.0335991 0.0217613 1.5439838 0.1238989
topic_22 0.0380281 0.0173324 2.1940433 0.0291836
topic_23 0.0168625 0.0234769 0.7182595 0.4732902
topic_24 0.0348916 0.0169743 2.0555549 0.0408986
topic_25 0.0273443 0.0253257 1.0797059 0.2813481
topic_26 0.0312590 0.0176186 1.7742063 0.0772863
topic_27 0.0209410 0.0184292 1.1362966 0.2569564
topic_28 0.0677562 0.0222270 3.0483765 0.0025563
topic_29 0.0219950 0.0204805 1.0739480 0.2839157
topic_30 0.0345708 0.0176023 1.9639920 0.0506757
topic_31 -0.0021999 0.0335033 -0.0656622 0.9477010
topic_32 0.0047848 0.1118821 0.0427665 0.9659229
topic_33 0.0281372 0.0189613 1.4839287 0.1391291
topic_34 0.0319834 0.0173598 1.8423853 0.0666421
topic_35 0.0279150 0.0171583 1.6269065 0.1050583
topic_36 0.0127701 0.0201107 0.6349897 0.5260350
topic_37 0.0346392 0.0207871 1.6663786 0.0969318
topic_38 0.0388209 0.0189901 2.0442700 0.0420094
topic_39 0.0365428 0.0265403 1.3768774 0.1698225
topic_40 0.0172476 0.0261905 0.6585459 0.5108135
topic_41 0.0377398 0.0203535 1.8542170 0.0649248
topic_42 0.0248161 0.0180613 1.3739901 0.1707160
topic_43 0.0331811 0.0189071 1.7549500 0.0805333
topic_44 0.0364368 0.0178258 2.0440551 0.0420308
topic_45 0.0994105 0.1427843 0.6962284 0.4869539
topic_46 0.0250350 0.0417817 0.5991848 0.5496102
topic_47 0.0684657 0.2844716 0.2406769 0.8100093
topic_48 0.0459670 0.0179390 2.5624016 0.0110015
topic_49 0.0281820 0.0176058 1.6007185 0.1107439
topic_50 0.0133341 0.0245449 0.5432550 0.5874543
topic_51 0.0768139 0.0237003 3.2410519 0.0013582
topic_52 0.0497146 0.0219592 2.2639504 0.0244631
topic_53 0.0307229 0.0228948 1.3419172 0.1808806
topic_54 0.0345447 0.0173587 1.9900518 0.0477105
topic_55 NA NA NA NA
cik4127 0.0050528 0.0125512 0.4025697 0.6876202
cik6281 0.0047687 0.0124193 0.3839731 0.7013355
cik723125 0.0074893 0.0122164 0.6130522 0.5404175
cik723531 0.0044749 0.0134251 0.3333207 0.7391808
cik743316 0.0028395 0.0122850 0.2311381 0.8174028
cik743988 0.0033442 0.0124527 0.2685524 0.7885029
cik769397 0.0181427 0.0121326 1.4953623 0.1361228
cik779152 0.0058436 0.0122335 0.4776735 0.6333138
cik796343 0.0007463 0.0121797 0.0612720 0.9511932
cik798354 0.0107927 0.0120591 0.8949898 0.3716818
cik804328 -0.0082341 0.0121683 -0.6766851 0.4992520
cik813672 0.0012690 0.0121904 0.1041005 0.9171758
cik827054 0.0060625 0.0123319 0.4916091 0.6234413
cik849399 0.0038317 0.0125188 0.3060772 0.7598090
cik877890 -0.0045308 0.0126916 -0.3569895 0.7214108
cik883241 0.0046864 0.0123963 0.3780467 0.7057273
cik896878 0.0062140 0.0123736 0.5022019 0.6159821
cik1013462 -0.0010103 0.0125133 -0.0807342 0.9357201
cik1045810 -0.0091426 0.0123840 -0.7382571 0.4610735
cik1101215 -0.0098667 0.0122508 -0.8053917 0.4213843
cik1108524 0.0092281 0.0121335 0.7605490 0.4476668
cik1123360 -0.0153089 0.0125179 -1.2229583 0.2225350
cik1136893 0.0044118 0.0124185 0.3552601 0.7227042
cik1141391 -0.0040708 0.0120817 -0.3369387 0.7364551
cik1175454 0.0057457 0.0126414 0.4545141 0.6498662
cik1341439 0.0153888 0.0126262 1.2187986 0.2241074
cik1365135 0.0063853 0.0124522 0.5127884 0.6085671
cik1383312 0.0302267 0.0128217 2.3574562 0.0191976
cik1403161 -0.0006274 0.0123547 -0.0507804 0.9595424
glance(lm_model) %>% kable() %>% 
  kable_styling(position = "center")
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.276611 0.0285065 0.0265696 1.114897 0.2616751 83 768.7174 -1367.435 -1045.549 0.1708383 242 326

Results, Limitations and Conclusion.

The management section of 10-K downloaded reports for the 30 companies was successfully extracted. The corpus built consists of 326 reports out of a possible total of 330 spanning the years of 2010-2020. Reports were cleaned using rvest and regex. Parsing errors were removed using hunspell package, frequency filtering and token length filtering. Stopwords were removed using a number of dictionaries, including finance specific ones. Text was POS tagged and nouns, adverbs and adjectives kept for further analysis.

Event study methodology was used to calculate abnormal returns, abnormal volume and variance of abnormal returns. Cumulative abnormal returns, average abnormal returns, cumulative abnormal volume and average abnormal volume for windows of length 2 days and 5 days were used as target variables. Variance of abnormal returns for window length 5 was also used. To create a baseline model fundamental indicators and technical indicators were used. To test whether sentiment has an effect on prices and return, multiple finance specific and non finance specific dictionaries were evaluated. Change in sentiment from last year’s report was also tested as a predictor variable. In total 572 models were fitted.

Although sentiment scores appeared to be statistically significant on a number of occasions and the model adjusted R squared improved the results were not consistent. P values were generally large and the sentiment scores often had different signs. For most of the dependent variables best performing models were baseline models consisting of fundamental and technical indicators. This inconsistency became even more apparent after conducting cross validation. Best performing models were ones that used the sentimentr algorithm combined with the LM dictionary. This performance is encouraging as it is consistent with theory. A finance specific dictionary combined with an advanced algorithm that takes into account negation should produce the most accurate results. However,despite this performance on cross validation the sentimentr-lm results are not statistically significant in predicting abnormal returns. Moreover, even the “custom” dictionary fitted with returns as response variable performs poorly on cross-validation. Therefore, there is little evidence to suggest that sentiment in the management section of 10-K reports can influence prices and even less that it can be used to make decisions such as building trading strategies. Nonetheless, it is important to note that this study only uses data for 326 reports and 30 companies in total. Including all major companies for a longer time period might yield different results. Furthermore, including other sections of the reports such Item 1 “Business” and Item 1A “Risk Factors” may be a fruitful avenue for future research as including them will increase text size and enable the algorithm to capture a larger proportion of sentiment.

Topic modelling was conducted using the both supervised and unsupervised stm approach. To select K values from 4-60 were explored. Looking at semantic coherence optimal k was selected to be 9. Company fundamentals and industry features such as pe-ratio and debt to equity ratio were used as prevalence factors. The unsupervised approach resulted in 55 topics. Unfortunately both approaches did not yield topics that had statistically significant results on cumulative abnormal returns.


Bibliography

Feldman, R., Govindaraj, S., Livnat, J., & Segal, B. (2010). Management’s tone change, post earnings announcement drift and accruals. Review of Accounting Studies, 15(4), 915-953.

MacKinlay, A. C. (1997). Event studies in economics and finance. Journal of economic literature, 35(1), 13-39.

Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011, July). Optimizing semantic coherence in topic models. In Proceedings of the 2011 conference on empirical methods in natural language processing (pp. 262-272).

Mourtgos and Adams. “The rhetoric of de-policing: Evaluating open-ended survey responses from police officers with machine learning-based structural topic modeling” Journal of Criminal Justice. 2019.

Stewart BM. 2020. Comment on: Github, T. Non-Atomic Vectors as Metadata? #212 [Online]. Comment posted on 9 Feb 2020. Available from: https://github.com/bstewart/stm/issues/212

Roberts ME, Stewart BM, Tingley D, Lucas C, Leder-Luis J, Gadarian SK, Albertson B, Rand DG (2014). “Structural Topic Models for Open-Ended Survey Responses.” American. Journal of Political Science, 58(4), 1064–1082. doi:10.1111/ajps.12103.

Yadav, P. K. (1992). Event studies based on volatility of returns and trading volume: A review. The British Accounting Review, 24(2), 157-184.

Yan, Q., 2020. Notes for “Text Mining with R: A Tidy Approach”. [ebook] Available at: https://bookdown.org/Maxine/tidy-text-mining/.

Package References

Hadley Wickham (2020). rvest: Easily Harvest (Scrape) Web Pages. R package version 0.3.6. https://CRAN.R-project.org/package=rvest

Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686

Garrett Grolemund, Hadley Wickham (2011). Dates and Times Made Easy with lubridate. Journal of Statistical Software, 40(3), 1-25. URL http://www.jstatsoft.org/v40/i03/

Silge J, Robinson D (2016). “tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” JOSS, 1(3). doi: 10.21105/joss.00037 (URL: https://doi.org/10.21105/joss.00037), <URL: http://dx.doi.org/10.21105/joss.00037>.

Hadley Wickham (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0. https://CRAN.R-project.org/package=stringr

Stefan Milton Bache and Hadley Wickham (2020). magrittr: A Forward-Pipe Operator for R. R package version 2.0.1. https://CRAN.R-project.org/package=magrittr

Hadley Wickham (2020). httr: Tools for Working with URLs and HTTP. R package version 1.4.2. https://CRAN.R-project.org/package=httr

Mario Annau (2015). tm.plugin.webmining: Retrieve Structured, Textual Data from Various Web Sources. R package version 1.3. https://CRAN.R-project.org/package=tm.plugin.webmining

Microsoft and Steve Weston (2020). foreach: Provides Foreach Looping Construct. R package version 1.5.1. https://CRAN.R-project.org/package=foreach

Microsoft Corporation and Steve Weston (2020). doParallel: Foreach Parallel Adaptor for the ‘parallel’ Package. R package version 1.0.16. https://CRAN.R-project.org/package=doParallel

Rinker, T. W. (2018). lexicon: Lexicon Data version 1.2.1. http://github.com/trinker/lexicon

Jeffrey A. Ryan and Joshua M. Ulrich (2020). quantmod: Quantitative Financial Modelling Framework. R package version 0.4.18. https://CRAN.R-project.org/package=quantmod

Jan Wijffels (2020). udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the ‘UDPipe’ ‘NLP’ Toolkit. R package version 0.8.5. https://CRAN.R-project.org/package=udpipe

Wilson Freitas (2021). bizdays: Business Days Calculations and Utilities. R package version 1.0.8. https://CRAN.R-project.org/package=bizdays

Alboukadel Kassambara (2019). ggcorrplot: Visualization of a Correlation Matrix using ‘ggplot2’. R package version 0.1.3. https://CRAN.R-project.org/package=ggcorrplot

Joshua Ulrich (2020). TTR: Technical Trading Rules. R package version 0.24.2. https://CRAN.R-project.org/package=TTR

R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Roberts ME, Stewart BM, Tingley D (2019). “stm: An R Package for Structural Topic Models.” Journal of Statistical Software, 91(2), 1-40. doi: 10.18637/jss.v091.i02 (URL: https://doi.org/10.18637/jss.v091.i02)

Ian Fellows (2018). wordcloud: Word Clouds. R package version 2.6. https://CRAN.R-project.org/package=wordcloud

Nicolas Proellochs and Stefan Feuerriegel (2021). SentimentAnalysis: Dictionary-Based Sentiment Analysis. R package version 1.3-4. https://CRAN.R-project.org/package=SentimentAnalysis

Posthuma Partners (2019). lmvar: Linear Regression with Non-Constant Variances. R package version 1.5.2. https://CRAN.R-project.org/package=lmvar

Jeroen Ooms (2021). magick: Advanced Graphics and Image-Processing in R. R package version 2.7.2. https://CRAN.R-project.org/package=magick

Jeffrey B. Arnold (2019). ggthemes: Extra Themes, Scales and Geoms for ‘ggplot2’. R package version 4.2.0. https://CRAN.R-project.org/package=ggthemes

Hao Zhu (2021). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.3.4. https://CRAN.R-project.org/package=kableExtra


Appendix