Module Title: Text Analytics

Module Code: “IB9CW0”

Year of Module: “2021”

Number of Pages: 131

Abstract

In this work textual information in the Management Discussion section of 10-K reports is used to derive useful insights from a business standpoint. These reports are publicly available through the EDGAR platform on the sec website. An NLP pipline is then constructed to clean and process the text for further analyis.

In part A a corpus is constructed by scraping the index of filings from the sec website, downloading the reports and extracting the relevant, item 7, section. Next, important keywords are examined which can provide value from either an analytically or business point of view. To this aim TF-IDF analysis is performed on unigrams, bigrams and trigrams.

In part B the goal is to link the sentiment of text to different financial indicators. Namely, abnormal returns, abnormal volume and volatility of abnormal returns. First prices, fundamental indicators and holidays dates are fetched to enable the calculation of abnormal returns, volume and variance. In addition to the fundamental indicators a number of technical indicators are calculated to build a more robust baseline model. Panel data analysis is performed.

In part C topic modelling is performed in an attempt to figure out which topics are commonplace in 10-K reports. Topic modelling is a useful technique because it can be used to instantly summarize lots of text. For example, on the date of report filling analysis could use this algorithm to summarize a report in a matter of seconds and link it to previous reports. Furthermore, topics can be linked to returns and other financial indicators.

Libraries

library(edgar)
library(rvest)
library(tidyverse)
library(lubridate)
library(tidytext)
library(stringr)
library(magrittr) # for Tee pipe
library(httr) # scrape with headers
library(htmltidy) # clean broken html
library(tm.plugin.webmining) # remove html method 2
library(foreach)
library(doParallel)
library(lexicon) # stopwords
library(quantmod)
library(udpipe) # text anotation 
library(lubridate) # date manipulation
library(bizdays) # business days manipulation
library(ggcorrplot) #plot correlation matrix
library(purrr) #map2
library(TTR) # TA indicators
library(splines) # for stm temporal
library(stm)
library(wordcloud) 
library(SentimentAnalysis)
library(lmvar) # cross validation
library(magick)
library(cowplot)
library(svglite)
library(ggthemes)
library(kableExtra)
library(sjPlot)

We assume that the default action is not to buy stocks based on sentiment of 10-K reports and the corresponding null hypothesis is that stock movements are not related to sentiment pf 10-K reports.

Part A

Portfolio Construction

Sector Selection

portfolio <- read_csv("portfolio-data.csv")
names(portfolio) <- stringr::str_replace_all(names(portfolio), " ", "_")
portfolio <- portfolio %>% rename(cik = CIK)

portfolio %>%
  mutate(GICS_Sub_Industry= as.factor(GICS_Sub_Industry)) %>% 
  group_by(GICS_Sub_Industry) %>% 
  count(sort = T) %>% 
  knitr::kable(caption = 'Portfolio option summary') %>% 
  kable_styling(position = "center")

Portfolio option summary
GICS_Sub_Industry	n
Semiconductors	13
Data Processing & Outsourced Services	12
Application Software	11
Technology Hardware, Storage & Peripherals	7
IT Consulting & Other Services	6
Communications Equipment	5
Electronic Equipment & Instruments	3
Internet Services & Infrastructure	3
Semiconductor Equipment	3
Systems Software	3
Electronic Components	2
Electronic Manufacturing Services	2
Technology Distributors	1

The top 3 industries have over 10 companies each. A portfolio is built from companies in these industries. The main reason for this approach is sample maximization. Primarily, this study aims to uncover a link between price and text. Secondarily,the goal is to examine the difference across industries. To achieve the later goal industries must be well represented in the sample. If differences in how price dynamics are detected across industries, further investigation can be made in industries that are less well represented. To summarize, the main motivation behind this portfolio choice is to maximize the statistical power of our study and generalisability of results.

industries <- c("Semiconductors", 
                "Data Processing & Outsourced Services", 
                "Application Software")

candidate_ciks <- portfolio %>% 
  filter(GICS_Sub_Industry %in% industries) %>% 
  pull(cik)

portfolio <- portfolio %>% 
  filter(cik %in% candidate_ciks)

rm(candidate_ciks, industries)

During the first attempt at extracting, the Edgar package was used. For the top 10 companies within each category 255 links were acquired from the master index through the edgar::getMasterIndex(2010:2020) call. In the best case scenario 70% of the records will be present. Furthermore, once the same package was used to extract the management description a further 58 records are lost leaving only 197 records for further analysis. This is 54% of the total theoretical size. Even with sophisticated replacement strategies such as extracting text from other sections and using 10-Qs instead of 10-K this is inadequate. Therefore, a custom scraping algorithm and section extraction algorithm is developed.

The results from scraping and parsing the daily_index from the sec website are significantly better than the ones provided by edgar package. Overall we get 10,991,106 records vs 7,942,110 records from the package, with roughly 20,000 more 10-K reports. Indeed for the selected industries all reports are available. A number of companies have less than 11 reports but research shows that this is a result of name change, and structural change of the company itself rather than a fault in the scraping procedure. Hence these companies are omitted as per original strategy and end up with 10 companies for each selected industry with 1 report corresponding to each year.

Another advantage of focusing on just a few sectors is that we reduce the variation of word use, which hopefully leads to more consistent sentiment scores for firms in the portfolio. Words are likely to have different meaning across companies and sectors This affects sentiment analysis negatively. By concentrating on a few industries variation is minimized.

Get the Data

Fetch Master Index Files

Utility function to help combine urls.

combURL <- function(base, addons, type="") {
  for (addon in addons) {
    base <- paste(base, addon, sep = "/")
  }
  return(paste0(base, type))
}

This operation cannot be parallelized or otherwise sped up as we need to stay under the SEC limit of 10 call per second. Additionally, Sys.sleep(0.1) is used to slow down the script. Since parsing might take a long time, index files are saved at this stage and parsed later in parallel fashion.

domain <- "https://www.sec.gov/Archives/edgar/daily-index"

if(!dir.exists("master_idx")){
  dir.create("master_idx")
}


for (year in 2010:2020) {
  for (i in 1:4) {
    qt <- paste0("QTR", i)
    url <- combURL(domain, c(year, qt, "index.json" ))
    Sys.sleep(0.1)
    GET(url, user_agent("Mozilla/5.0"), write_disk("temp.json"))
    file <- jsonlite::fromJSON("temp.json")
    for (link in file$directory$item$name) {
      if (str_detect(link, "master")) {
        url <- combURL(domain, c(year, qt, link))
        if(!dir.exists(combURL("master_idx", c(year)))) {
          dir.create(combURL("master_idx", c(year)))
        }
        if(!dir.exists(combURL("master_idx", c(year, qt)))) {
          dir.create(combURL("master_idx", c(year, qt)))
        }
        filename <- combURL("master_idx", c(year, qt, link))
        GET(url, user_agent("Mozilla/5.0"), write_disk(filename))
        Sys.sleep(0.1)
      }
    }
    file.remove("temp.json")
  }
}

Parse Master Index Files

cluster <- NULL

#utility function to register cluster.
register_cores <- function() {
  n_cores <- parallel::detectCores() - 12
  cluster <<- parallel::makeCluster(n_cores, type = "PSOCK")
  doParallel::registerDoParallel(cl = cluster)
  foreach::getDoParRegistered()
}

Parse idx files into a data frame for each year and quarter.

if(!dir.exists("master_indexes")){
  dir.create("master_indexes")
}

#Setup for parallel computing
register_cores()

#Use "foreach" loop from the foreach package which supports parrallel operations. 
for (year in 2010:2020) {
  year_master <- foreach(
    q = 1:4,
    .combine=rbind,
    .packages=c('tidyverse', 'stringr')
  ) %dopar% {
    qt <- paste0("QTR", q)
    url <- combURL("master_idx", c(year, qt))
    files <- list.files(url)
    q_master <- data.frame()
    for (file in files) {
      filename <- combURL("master_idx", c(year, qt, file))
      file <- readLines(filename)
      # split lines
      file <- str_split(file, '  ')
      # trim heading
      file <- file[8:length(file)]
      file_df <- data.frame()
      for (i in 1:length(file)) {
        # split into columns
        l <- str_split(file[[i]][1], '\\|')
        # convert to df
        df <- data.frame(cik=l[[1]][1], 
                         name=l[[1]][2], 
                         form_type=l[[1]][3], 
                         date=l[[1]][4], 
                         link=l[[1]][5], 
                         qtr = q)
        file_df <- rbind(file_df, df)
      }
      q_master <- rbind(q_master, file_df)
    }
    return(q_master)
  }
  # save result
  filename <- combURL("master_indexes", c(year), type = "_year_master.rda")
  save(year_master, file =  filename)
  rm(year_master)
  gc()
}

rm(q_master, df, file_df, l)

#close cluster
parallel::stopCluster(cl = cluster)

Combine all saved year master indexes into one data frame.

master_indexes <- list.files("master_indexes/",pattern="rda")
all_my_indexes <- data.frame()

for(master_index in master_indexes){
  load(paste0("master_indexes/",master_index))
  this_index <- year_master
  all_my_indexes <- bind_rows(all_my_indexes,this_index)
  print(master_index)
}
all_my_indexes <- all_my_indexes[-c(1:11),]

rm(this_index)

Dowload Files for Selected Industries

# update master index
all_my_indexes <- all_my_indexes %>% 
  filter(form_type == "10-K") %>%
  filter(cik %in% portfolio$cik)

domain <- "https://www.sec.gov/Archives/"

if(!dir.exists("full_text")) {
  dir.create("full_text")
}

for (i in 1:length(all_my_indexes$cik)) {
  row = all_my_indexes[i,]
  url <- paste0(domain, row$link)
  print(url)
  Sys.sleep(0.1)
  dirname <- paste0("full_text/", row$cik)
  dirname <- paste0(dirname, "/")
  print(dirname)
  if(!dir.exists(dirname)){
    dir.create(dirname)
  }
  filename <- paste0(paste0(dirname,row$date),".txt")
  print(filename)
  if(!file.exists(filename)) {
    print(filename)
    GET(url, user_agent("Mozilla/5.0"), write_disk(filename))
  }
}

rm(row, dirname, filename, url)

All files 369 downloaded successfully.

Regex Manual Extraction

# clean document titles
# clean item tags
cleanDocTitle <- function(text) {
  text <- str_replace(text, 'm&nbsp;', 'm')
  text <- str_replace(text, '>&nbsp;', ' ')
  text <- str_replace(text, '<[\\s\\S]*>', ' ')
  text <- str_replace_all(text, '\n', ' ')
  text <- str_replace_all(text, '"', ' ')
  text <- str_replace_all(text, '&#160;', ' ')
  text <- str_replace_all(text, '&nbsp;', ' ')
  text <- str_replace_all(text, ' ', '')
  text <- str_replace_all(text, '\\.', ' ')
  text <- str_replace_all(text, '>', ' ')
  text <- trimws(text)
  text <- tolower(text)
  return (text)
}

# remove html from text using rvest
strip_html <- function(text) {
  if (!is.na(text)) {
    if (text!= "") {
      tryCatch( {
        text <- html_text(read_html(text))
      }, error=function(cond) {
        text <- extractHTMLStrip(text)
      })
    }
  }
  return (text)
}

dirs <- list.dirs("full_text") 
master <- data.frame()

getSections <- function(regex, text) {
  start_end_section <- stringr::str_locate_all(text, regex)
  start_end_section <- as.data.frame(start_end_section)
  return(start_end_section)
}


for (dir in dirs[2:length(dirs)]) {
  files <- list.files(dir)
  for (file in files) {
    date <- file
    date <- str_remove(date, ".txt")
    cik <- str_remove(dir, "full_text/")
    file <- paste(dir, file, sep = "/" )
    text <-  read_file(file)
    
    doc_start <- as.data.frame(stringr::str_locate_all(text, "<DOCUMENT>"))
    doc_end <- as.data.frame(stringr::str_locate_all(text, "</DOCUMENT>"))
    type <- as.data.frame(stringr::str_locate_all(text, '<TYPE>[^\n]+'))
    
    for (i in 1:length(doc_start$start)) {
      doc <- substr(text, type$start[i],  type$end[i])
      if (str_detect(doc, "10-K")) {
        regex <- '(>|&nbsp;\\s|>&#160;|>&nbsp;)(Item|ITEM|Ite|It|"Item")(\\s|&#160;|&nbsp;|<a name="(1A|1B|7A|7|8|9|9A)">\\s|<.*?>m&nbsp;|\\s<.*?>)(1A|1B|7A|7|8|9|9A)\\.{0,1}'
        start_end_section <- getSections(regex, text=text)
        item7 <- NA
        temp_df <- NA
        tryCatch(
          {
            
            start_end_section$item  <- cleanDocTitle(substring(text, 
                                                               first=start_end_section$start, 
                                                               last= start_end_section$end))
            
            # select item 9 or 9a
            is_item_9_detected = (start_end_section %>% 
                                    filter(item == "item9") %>% 
                                    count() %>% pull(n)) > 0
            item9_lable <- ifelse(is_item_9_detected, "item9", "item9a")
            
            # select top item 9 or 9a start index
            top_item <- start_end_section %>% 
              filter(item == item9_lable) %>% 
              arrange(desc(start)) %>% 
              slice(1) %>% 
              pull(start)
            
            if (!is.na(top_item)) {
              # use top item 9 as upper bound
              start_end_section <- start_end_section %>% 
                filter(start < top_item) %>%  
                filter(!(item %in% c("item9", "item9a")))
            } 
            
            
            # top item from each item group
            start_end_section <- start_end_section %>% 
              group_by(item) %>% 
              arrange(desc(start)) %>% 
              slice(1) %>%  
              ungroup()
            
            # select item 8 or 7a
            is_item_8_detected = (start_end_section %>% 
                                    filter(item == "item8") %>% 
                                    count() %>% pull(n)) == 1
            end_lable <- ifelse(is_item_8_detected, "item8", "item7a")
            
            row_index_item8 <- start_end_section %>% 
              filter(item == end_lable) %>% 
              pull(start)
            
            # select item 7 or 7a
            is_item_7_detected = (start_end_section  %>% 
                                    filter(item == "item7") %>%
                                    count() %>% pull(n)) == 1
            
            start_lable <- ifelse(is_item_7_detected, "item7", "item7a")
            
            row_index_item7 <- start_end_section %>% 
              filter(item == start_lable) %>% 
              pull(start)
            
            # use item7a if item 7 is found after item 8. Preserves 3 reports.
            if (row_index_item7 > row_index_item8) {
              row_index_item7 <- start_end_section %>% 
                filter(item == "item7a") %>%
                pull(start)
            }
            
            
            item7 <- substr(text, start = row_index_item7, stop = row_index_item8)
            item7 <- strip_html(item7)
          },
          error=function(cond) {
            print(cik)
            print(date)
            message("Error message:")
            message(cond)
            return(NA)
          })
        if (!is.na(item7)) {
          if (item7 == "") {
            print(start_end_section)
            print(cik)
            print(date)
          }
        }
        temp_df <- data.frame(cik=cik, date=date, text=item7)
        master <- rbind(master, temp_df)
      }
    }
  }
}

rm(temp_df, file, item7, lable, row_index_item7, row_index_item8, start_end_section)

One report failed to parse. An investigation shows that there is actually no management discussion in the report. Instead there is reference to another report.

master %>% 
  mutate(text_size = nchar(text)) %$% 
  hist(text_size, breaks = 20, xlim = c(0, 1000000))

The distribution of text size helps detect more errors. Small text size is a sign that problems might have occurred. From this sample it seems anything bellow 2000 words is likely to not contain relevant text.

TEXAS INSTRUMENTS INC (CIK: 97476) report consists entirely of references to annual shareholder letters.
INTEL (CIK: 50863) is an outlier in their formatting practice. They have almost entirely forgone the traditional approach to 10-K styling.

Both this companies are eliminate from our portfolio.

Select Companies

# filter by available reports
portfolio_ciks <- master %>% 
  mutate(text_size = nchar(text)) %>% 
  filter(!(cik %in% c(97476, 50863))) %>% 
  filter(!(is.na(text) | text ==  "")) %>% 
  filter(text_size > 2000) %>% 
  unique() %>% 
  mutate(cik = as.numeric(cik)) %>% 
  inner_join(portfolio) %>% 
  count(GICS_Sub_Industry, cik) %>% 
  group_by(GICS_Sub_Industry) %>% 
  arrange(desc(n), .by_group = TRUE) %>% 
  filter(n > 7) %>% 
  pull(cik)

## Joining, by = c("cik", "Symbol", "Security", "GICS_Sector", "GICS_Sub_Industry")

# update portfolio
portfolio <- portfolio %>% 
  filter(cik %in% portfolio_ciks)

# update parsed text df
master <- master %>% 
  unique() %>%
  filter(!(is.na(text) | text ==  "")) %>%
  mutate(text_size = nchar(text)) %>%
  filter(text_size > 2000) %>%
  mutate(date = as.Date(date, format =  "%Y%m%d")) %>% 
  mutate(cik = as.numeric(cik)) %>%
  filter(cik %in% portfolio_ciks) %>% 
  inner_join(portfolio) %>% 
  mutate(doc_id = row_number())

## Joining, by = c("cik", "Symbol", "Security", "GICS_Sector", "GICS_Sub_Industry")

rm(portfolio_ciks)
#saveRDS(master, "master.rds")

Final Portfolio

portfolio$GICS_Sector <- NULL
knitr::kable(portfolio, caption = "Portfolio of Selected Companies")

Portfolio of Selected Companies
Symbol	Security	GICS_Sub_Industry	cik
ADBE	Adobe Systems Inc	Application Software	796343
AMD	Advanced Micro Devices Inc	Semiconductors	2488
ADS	Alliance Data Systems	Data Processing & Outsourced Services	1101215
ADI	Analog Devices, Inc.	Semiconductors	6281
ANSS	ANSYS	Application Software	1013462
ADSK	Autodesk Inc.	Application Software	769397
BR	Broadridge Financial Solutions	Data Processing & Outsourced Services	1383312
CDNS	Cadence Design Systems	Application Software	813672
CTXS	Citrix Systems	Application Software	877890
FIS	Fidelity National Information Services	Data Processing & Outsourced Services	1136893
FISV	Fiserv Inc	Data Processing & Outsourced Services	798354
FLT	FleetCor Technologies Inc	Data Processing & Outsourced Services	1175454
GPN	Global Payments Inc.	Data Processing & Outsourced Services	1123360
INTU	Intuit Inc.	Application Software	896878
JKHY	Jack Henry & Associates	Data Processing & Outsourced Services	779152
MA	Mastercard Inc.	Data Processing & Outsourced Services	1141391
MXIM	Maxim Integrated Products Inc	Semiconductors	743316
MCHP	Microchip Technology	Semiconductors	827054
MU	Micron Technology	Semiconductors	723125
NLOK	NortonLifeLock	Application Software	849399
NVDA	Nvidia Corporation	Semiconductors	1045810
ORCL	Oracle Corp.	Application Software	1341439
PAYX	Paychex Inc.	Data Processing & Outsourced Services	723531
QCOM	QUALCOMM Inc.	Semiconductors	804328
CRM	Salesforce.com	Application Software	1108524
SWKS	Skyworks Solutions	Semiconductors	4127
SNPS	Synopsys Inc.	Application Software	883241
V	Visa Inc.	Data Processing & Outsourced Services	1403161
WU	Western Union Co	Data Processing & Outsourced Services	1365135
XLNX	Xilinx	Semiconductors	743988

NLP pipeline

Text Normalisation

Html was already striped during parsing now further cleaning needs to be done to remove digits and symbols. At this stage two cleaning functions are defined; one to remove punctuation completely the other attempts to remove extra punctuation, mainly table leftovers, but preserve the sentence structure. Whilst punctuation is not necessary in most cases, the sentimentr package used in Part B relies on punctuation to identify inflection and analyses sentiment at sentence level. However, due to the presence of tables in this dataset removing punctuation accurately is challenging. Hence, the data is split into two formats.

clean_text_retain_puntuation <- function(text) {
  #we trim the text to remove section title.
  text <- stringr::str_sub(text,start =  94,end = -1)
  #unescape unicode
  text <- stringi::stri_unescape_unicode(text)
  #sets all chars to unicode
  text <- iconv(text, "ASCII",  sub = " ")
  #removes line breaks
  text <- stringr::str_replace_all(text, "\r?\n|\r|\t", " ")
  text <- stringr::str_replace_all(text, "[[digit:]]$", " ")
  #removes digits
  text <- tm::removeNumbers(text)
  #remove $ and % sign
  text <- stringr::str_replace_all(text, "\\$|\\$(\\.)", " ")
  text <- stringr::str_replace_all(text, "%", " ")
  text <- qdap::bracketX(text)
  text <- str_squish(text)
  #remove repeating charachters
  text <- stringr::str_replace_all(text, '([?@])\\1+', " ") 
  text <- stringr::str_replace_all(text, '\\,\\.|\\.\\,', " ") 
  return (text)
}

text_with_punctuation <- parallel::mclapply(master$text, clean_text_retain_puntuation)
text_with_punctuation <- unlist(text_with_punctuation)
saveRDS(text_with_punctuation, "text_with_punctuation.rds")

clean_text <- function(text) {
  #we trim the text to remove section title.
  text <- stringr::str_sub(text,start =  94,end = -1)
  #unescape unicode
  text <- stringi::stri_unescape_unicode(text)
  #sets all chars to unicode
  text <- iconv(text, "ASCII",  sub = " ")
  #removes line breaks
  text <- stringr::str_replace_all(text, "\r?\n|\r|\t", " ")
  #removes digits
  text <- tm::removeNumbers(text)
  #remove non word breaking punctuation
  text <- tm::removePunctuation(text,
                                preserve_intra_word_contractions = T,
                                preserve_intra_word_dashes = T)
  return (str_squish(text))
}

cleaned <- parallel::mclapply(master$text, clean_text )
master$text <- unlist(cleaned)
rm(cleaned)
saveRDS(master, "master_cleaned.rds")

POS Tagging

Next, Part of Speech tagging is conducted. Importantly, this is done before stopword removal as common stopowrds provide important grammatical information which helps the tagger distinguish between nouns and verbs. In other words, POS tagging looks at the sequence as a whole.

langmodel <- udpipe::udpipe_download_model("english")
langmodel <- udpipe::udpipe_load_model(langmodel$file_model)
postagged_text <- udpipe_annotate(langmodel,
                                  master$text,
                                  parallel.cores = 15,
                                  trace = T)

postagged_text <- as.data.frame(postagged_text)
saveRDS(postagged_text, "postagged_text.rds")

postagged_text <- readRDS("postagged_text.rds")
master <- readRDS("master_cleaned.rds")

Stopword Removal

Stopwords are words that carry only grammatical meaning but provide little sentiment value on their own in a typical bag of words model. Thefore, they are removed to reduce the noise in the dataset and speed up computation.

In addition to standard stopword dictionaries such as the SMART lexicon, NLTK dictionary and Fry’s top 100 dictionary finance specific dictionaries are used. LM provide custom dictionaries for financial data on website:https://sraf.nd.edu/textual-analysis/resources/#StopWords

These are used to filter out names of auditors who audit the reports, names of management, references to geographic locations as well as numbers.

#load downloaded dictionaries 
SW_Auditor <- data.frame(word = readLines("stopwords/StopWords_Auditor.txt"))
SW_Currencies <- read_delim("stopwords/StopWords_Currencies.txt", delim = "|")[1]
names(SW_Currencies) <- c("word")
SW_DatesNumbers <- data.frame(word = readLines("stopwords/StopWords_DatesandNumbers.txt"))
SW_Geographic <- data.frame(word = readLines("stopwords/StopWords_Geographic.txt"))
SW_Names <- data.frame(word = readLines("stopwords/StopWords_Names.txt"))

StopWords_LM <- rbind(SW_Auditor,SW_Currencies, SW_Names, SW_DatesNumbers, SW_Geographic)
#after POS we will take the lemma meaning  words will be in lower case
StopWords_LM <- StopWords_LM %>% mutate(word = tolower(str_replace_all(word,"[^[:graph:]]", " ")))

rm(SW_Auditor,SW_Currencies, StopWords_Names, SW_DatesNumbers, SW_Geographic)

#remove words unrelated to content
my_stopwords <-  data.frame(word = c("Table", "of", "Contents", "table", "contents"))
#remove company names
company_names <- portfolio %>% unnest_tokens(word, Security) %>% select(word)
my_stopwords <- rbind(my_stopwords, company_names)
rm(company_names)
#Load standard stopword dictionaries
stopwords_nltk<- as.data.frame(stopwords::data_stopwords_nltk$en) 
data(sw_fry_100)
stopwords_fry <- as.data.frame(sw_fry_100)
names(stopwords_fry) <- c("word")
names(my_stopwords) <- c("word")
names(stopwords_nltk) <- c("word")

Term Frequency Filtering

Rather than filtering by tf-idf a cautious approach is exercised. All words which appear more than 5 times are kept. This deals with the vast majority of parsing mistakes. Method specific tf-idf trimming is applied as need later on.

# document term frequency filter
document_term_freq_filter <- function(tokens) {
  tokens <- tokens %>% 
    count(word) %>%
    filter(n > 5) %>% 
    inner_join(tokens)
  
  return(tokens)
}

Parsing Error Removal

#we use mistake detection to remove parsing errors
#we don't expect actual mistakes in 10-K reports
#mistake detection is done after legalization and POS filtering 
#this saves computing time as we have less tokens

#hunspell wrapper -- pre configured to use bitish english followed by us english
hunspell_double_english <- function(word) {
  word <- unlist(hunspell::hunspell(word, dict = 'en_US'))
  if (is.character(word)) {
    word <- unlist(hunspell::hunspell(word, dict = 'en_GB'))
  }
  return (word)
}

Putting it all together

# function to monitor token count in pipe
count_tokens <- function(tokens, name = "") {
  tokens %>% 
    count(word) %>% 
    summarise(total = sum(n)) %$% 
    print(total)
  print(name)
  return (tokens)
}


tokens <- postagged_text %>% 
  filter(upos %in% c("NOUN","ADJ","ADV")) %>%
  select(lemma, doc_id) %>% 
  rename(word = lemma) %>% 
  mutate(word = tolower(word)) %>%
  count_tokens("Stage: Initial") %>% 
  anti_join(stop_words, by = "word") %>% 
  count_tokens("Stage 1: after SMART") %>% 
  anti_join(stopwords_nltk, by = "word") %>% 
  count_tokens("Stage 2: after NLTK") %>% 
  anti_join(my_stopwords, by = "word") %>% 
  count_tokens("Stage 3: after My Stopwords") %>% 
  anti_join(StopWords_LM, by = "word") %>% 
  count_tokens("Stage 4: after LM Stopwords") %>% 
  mutate(token_length=nchar(word)) %>% 
  arrange(token_length) %>% 
  filter(token_length > 3) %>% 
  filter(token_length < 17) %>%
  arrange(token_length) %>% 
  count_tokens(name="stage 5: Remove Token Length Filter") %>% 
  document_term_freq_filter() %>% 
  count_tokens(name="Stage 6: Term Frequency Filter")

lematized <- tokens %>% 
  group_by(doc_id) %>% 
  summarise(documents_pos_tagged = paste(word,collapse = " "))

# remove mistakes
mistakes <- parallel::mclapply(lematized$documents_pos_tagged, hunspell_double_english)
mistakes <- unique(mistakes)
mistakes <- data.frame(word = unlist(mistakes))

master <- master %>% 
  mutate(doc_id = paste0("doc",row_number()))

tokens <- tokens %>% 
  anti_join(mistakes) %>% 
  count_tokens(name="Stage 7: After Mistake Removal") %>% 
  inner_join(master %>% select(-text))

#saveRDS(mistakes, "mistakes.rds") #backup

#update lematized text
lematized <- tokens %>% 
  group_by(doc_id) %>% 
  summarise(documents_pos_tagged = paste(word,collapse = " ")) 

#add lematized text to main df
master <- master %>% 
  left_join(lematized)

rm(mistakes, lematized, langmodel, StopWords_LM, stopwords_nltk, stopwords_fry)

saveRDS(tokens, "tokens.rds")
saveRDS(master, "master_pos.rds")

TF-IDF Analysis

At this stage, Tf-Idf analysis is used as an exploratory tool to better understand what terms in 10-K reports are important among different industries. Functions are defined which allow for dynamic tf-idf trimming and document term frequency filtering within groups. This enables exploration of keywords at different grouping levels with various levels of trim. 3 types of tokens are surveyed; unigrams, bigrams and trigrams.

The methodology used is as follows: * Examine words ranked by frequency * Examine words ranked by frequency with additional trimming * Examine words ranked by tf-idf * Examine words ranked by tf-idf with additional trimming

#bind tf-idf by specified measure
bind_tf_idf_custom  <- function(tokens, by, within_group_freq_bound) {
  tokens <- tokens %>% 
    drop_na(.data[[by]]) %>% 
    drop_na(word) %>% 
    count(word, .data[[by]]) %>% 
    filter(n > within_group_freq_bound) %>% 
    bind_tf_idf(word, .data[[by]], n)  
}

#filter tokens by tfidf for specified quantiles
trim_by_tfidf <- function(tokens, quantiles) {
  if (!is.null(quantiles)) {
    quantiles <- tokens %$%
      quantile(tf_idf, probs = quantiles) %>% 
      tidy(quantiles, na.rm = F)
    tokens <- tokens %>% 
      filter(tf_idf > quantiles$x[1], tf_idf < quantiles$x[2]) 
  } 
  return(tokens)
}

#group and count based on measure specified
summarise_conditionally <- function(tokens, measure, group_by) {
  if (measure != "n") {
    tokens <- tokens %>% 
      group_by(.data[[group_by]]) %>% unique()
  } else {
    tokens <- tokens %>% 
      group_by(.data[[group_by]], word) %>%
      summarise(n= sum(n))
  }
  return(tokens)
}

#bind by category, group by same or other category
#filter by tf-idf or within group frequency
#plot top n tokens for group
filter_bind_plot <- function(df, 
                             tokens, 
                             id, 
                             bind_by, 
                             group_by,
                             measure, 
                             within_group_freq_bound = 0, 
                             quantiles = NULL, 
                             n = 10) {
  
  meta <- df %>% select(.data[[id]],.data[[group_by]], .data[[ bind_by]])  %>% unique(.) 
    tokens %>% 
      bind_tf_idf_custom(by=bind_by, within_group_freq_bound) %>% 
      trim_by_tfidf(quantiles) %>% 
      inner_join(meta) %>% 
      select(.data[[measure]], .data[[group_by]], word) %>% 
      summarise_conditionally(measure = measure, group_by = group_by) %>% 
      arrange(desc(.data[[measure]])) %>% 
      mutate(row_number = row_number()) %>% 
      filter(row_number %in% 1:15) %>% 
      facet_bar(y = word, x = .data[[measure]], by = .data[[group_by]], name = name)
  
}

#adapted from (Yan, 2020)
#utility function which reorders words based on give measure within groups
facet_bar <- function(df, y, x, by, nrow = 1, ncol = 3, scales = "free", name="") {
  mapping <- aes(y = reorder_within({{ y }}, {{ x }}, {{ by }}), 
                 x = {{ x }}, 
                 fill = {{ by }})
  
  facet <- facet_wrap(vars({{ by }}), 
                      nrow = nrow, 
                      ncol = ncol,
                      scales = scales) 
  
  ggplot(df, mapping = mapping) + 
    geom_col(show.legend = FALSE) + 
    scale_y_reordered() + 
    facet + 
    ylab("") + 
    theme_light()
} 

#dir to save images
if(!dir.exists("PartA")){
  dir.create("PartA")
}

Unigrams

GICS_Sub_Industry: Frequency

filter_bind_plot(master,  
                 tokens,
                 id="doc_id", 
                 group_by = "GICS_Sub_Industry",
                 bind_by = "GICS_Sub_Industry", 
                 measure = "n", n =15)
ggsave("PartA/figure1.png", width = 40, height = 15, units = "cm")

Top 15 terms in per GICS_Sub_Industry ranked by frequency

filter_bind_plot(master,  
                 tokens,
                 id="doc_id", 
                 group_by = "GICS_Sub_Industry",
                 bind_by = "cik", 
                 measure = "n", n =15)
ggsave("PartA/figure2.png", width = 40, height = 15, units = "cm")

Defining document level at company or entire industry does not yield much variation in terms of most frequent terms. With little variation accross industries it appears that these tokens are used in any typical 10-K report. This makes sense as companies are expected to discuss “revenue”, “cost” and “income”.

filter_bind_plot(master,  
                 tokens,
                 id="doc_id", 
                 group_by = "GICS_Sub_Industry",
                 bind_by = "GICS_Sub_Industry",
                 measure = "n", 
                 quantiles = c(0, 0.9999), n=15)
ggsave("PartA/figure3.png", width = 40, height = 15, units = "cm")

Trimming the bag of words model using tf-idf removes most of the overlapping frequent terms.

GICS_Sub_Industry: Tf-Idf

filter_bind_plot(master,  
                 tokens,
                 id="cik", 
                 group_by = "GICS_Sub_Industry",
                 bind_by = "GICS_Sub_Industry", 
                 measure = "tf_idf", n=15)
ggsave("PartA/figure4.png", width = 40, height = 15, units = "cm")

Ranking words using tf-idf without trimming results in a selection of specific industry terms. For example, looking at the semiconductor industry, terms such a “wafer”, “gigabit”, “chipset” and “foundry” are dominant. These all refer to the manufacturing or components of video cards.

filter_bind_plot(master,  
                 tokens,
                 id="doc_id", 
                 group_by = "GICS_Sub_Industry",
                 bind_by = "GICS_Sub_Industry", 
                 measure = "tf_idf", 
                 within_group_freq_bound = 100)
ggsave("PartA/figure5.png", width = 40, height = 15, units = "cm")

By applying heavy within group term frequency filter removes rare terms with high tf-idf allowing more dominant terms to stand out. For example, the name “MasterCard” is successfully removed from Data Processing industry.

Overall this final analysis presents a good illustration of important keywords across the 3 industries.

In the case of the application software subscription is the most dominant term.

filter_bind_plot(master,  
                 tokens,
                 id="doc_id", 
                 group_by = "GICS_Sub_Industry",
                 bind_by = "GICS_Sub_Industry", 
                 measure = "tf_idf", 
                 quantiles = c(0.01, 0.98))
ggsave("PartA/figure6.png", width = 40, height = 15, units = "cm")

Trimming using tf-dif rather than within group frequency yields a messier set of terms. This is because infrequent terms can still persist amongs the different catagories.

Bigrams

bigrams <- master %>% 
  unnest_tokens(word, documents_pos_tagged, token="ngrams", n=2) %>% 
  drop_na(word)

ggplot(head(bigrams %>% group_by(word) %>% count() %>% arrange(desc(n)),15), 
       aes(reorder(word,n), n)) +
  geom_bar(stat = "identity") + coord_flip() +
  xlab("Bigrams") + ylab("Frequency") +
  ggtitle("Most frequent bigrams")

ggsave("PartA/figure7.png", width = 40, height = 15, units = "cm")

filter_bind_plot(master,
           bigrams,
           id="doc_id", 
           group_by = "GICS_Sub_Industry", 
           bind_by = "GICS_Sub_Industry",
           measure = "n", quantiles = c(0.25, 0.98))
ggsave("PartA/figure8.png", width = 40, height = 15, units = "cm")

filter_bind_plot(master,
           bigrams,
           id="doc_id", 
           group_by = "GICS_Sub_Industry", 
           bind_by = "GICS_Sub_Industry",
           measure = "tf_idf", within_group_freq_bound = 50)
ggsave("PartA/figure9.png", width = 40, height = 15, units = "cm")

Analysis of Trigrams is not very informative. Nonetheless, it does support some of the finding from unigram analyis. For example, in the semi-conductors industry reports many parts of video cards are metnioned whilst in application software subscription is commonly discussed.

Trigrams

For trigrams raw text is used in an attempt to get more meaningfull results.

trigrams <- master %>% 
  unnest_tokens(word, text, token="ngrams", n=3) %>% 
  drop_na(word) 

ggplot(head(trigrams %>%
  group_by(word) %>% 
  count() %>% 
  arrange(desc(n)),15), aes(reorder(word,n), n)) +
  geom_bar(stat = "identity") + coord_flip() +
  xlab("Tigrams") + ylab("Frequency") +
  ggtitle("Most frequent tigrams")

ggsave("PartA/figure10.png", width = 40, height = 15, units = "cm")

filter_bind_plot(master,
                 trigrams,
                 id="doc_id", 
                 group_by = "GICS_Sub_Industry", 
                 bind_by = "GICS_Sub_Industry",
                 measure = "tf_idf")

A “Air miles reward” suggests that credit card rewards are an important topic in the Data Pocessing and Outsourced Services sub industry. Not surprising given that Visa and Mastercard are part of this sector. Software license updates, hardware system support as well as subscriptions are domminat tokens in the Application Software Sector. In the semiconductor industry the focus is on shipping and manifacturing. This insight will be usefull when lableing topics in the topic modleling phase.

Part B

An event study is performed at this stage. Event studies have a long history going back to 1933. The main idea behind event studies is to compare the return of a stock during some window after an event to some baseline estimated from past data. This return is known as the abnormal return which can be attributed to the event. Largely we follow the methodology of MacKinlay (1997).

Formally abnormal return is defined as:

\[ AR_{it} = R_{it} - E(R_{it}|X_t) \]

Abnormal Returns = Actually - Normal for time window t. \[ X_t \] represents some conditional information on which the normal return is modeled.

Normal return needs to be modeled before we begin our analysis. A simple approach would be to take the return on the stock a week before. However, this is not a sound approach since anticipation of filings report is already affecting the market prices. In practice two models are often used; the constant mean return model and the market model. The first assumes that a given security has a constant mean return across time. The second assumes that there is a linear relationship between the return on the security and the market (MacKinlay, 1997). The market model is an improvement over the constant mean model because it helps reduce the variance associated with market movements. There are more complicated models such as the Fama French 3 factor model and 5 factor model which help reduce the variance associated with different firm types. However, in our case we assume that there are indeed abnormal returns at least in some cases and we seek to understand whether these can be attributed to sentiment of the management section of the report. So instead of including fundamental indicator information into market model during the abnormal return estimation phase it is used as a control during the regression on sentiment phase.

After a model is chosen and event window defined, abnormal and cumulative abnormal returns on a stock after filling are calculated by subtracting benchmark/normal returns from actual returns. Following MacKinlay (1997) the estimation window for normal returns is defined as 250 trading days before the event window. This is roughly equivalent to 1 calendar year. The event window is defined as 2 weeks before filing and one week after filing. The two week gap serves to ensure independence of normal returns from the event driven returns. Abnormal returns are only calculated for the period after the filing.

Calendar non-trading day adjustments

Before prices are fetched to calculate returns dates need to formated and adjusted. This is done because the stock market is closed on weekends and holidays. This means that if we simply take n days after and before filing to calculate some financial indicator we will get NA values if this date falls on a holiday. To prevent this we need to offset days based on the business day calendar. Hence we declare holidays using to the bizdays package and offset using its api. This approach is in line with how returns are normally calculated in the financial sector.

Holidays for which US stock exchange does not work in 2021:

New Year’s Day: Friday, Jan. 1
Martin Luther King Jr. Day: Monday, Jan. 18
Washington’s Birthday/Presidents Day: Monday, Feb. 15
Good Friday: Friday, April 2
Memorial Day: Monday, May 31
Independence Day: Monday, July 5 (observed, because July 4 falls on a Sunday)
Labor Day: Monday, Sept. 6
Thanksgiving: Thursday, Nov. 25
Christmas: Friday, Dec. 24 (observed, because Christmas Day falls on a Saturday)

Note that many holidays are shifted when they fall on weekdays, whilst others depend on week number. Other holidays like “Inauguration Day” occur every 4 years. This means that formulating a rules based approach is tedious and error prone. Instead a publicly available API is used to get all the days for US Federal holidays. However, Good Friday is not a national holiday but a state one. Therefore, dates for this holiday a scraped from another website. A number of dates are added manually: the stock market closed during Hurricane Sandy and to commemorate George Bush’s death.

domain <- "https://date.nager.at/api/v3/PublicHolidays"
years <- 2008:2021

holidays <- c()
for (year in years) {
  url <- url(combURL(domain, c(year, "US")))
  json <- jsonlite::stream_in(url)
  holidays <- c(holidays, json$date)
  on.exit(close(url))
}


#get page with religious public holidays
public_holidays <- read_html("http://www.maa.clell.de/StarDate/publ_holidays.html")

tbl1 <- public_holidays %>% html_nodes("table") %>% 
  .[[6]] %>% 
  html_table() %>% 
  as.data.frame() %>% 
  filter(X1 %in% years) %>% 
  mutate(X3 = str_replace_all(X3, "\\.", "-")) %>% 
  mutate(date = ymd(ydm(paste(X1, X3, sep="-")))) %>% 
  pull(date)

tbl2 <- public_holidays %>% html_nodes("table") %>% 
  .[[7]] %>% 
  html_table() %>% 
  as.data.frame() %>% 
  filter(X1 %in% years) %>% 
  mutate(X3 = str_replace_all(X3, "\\.", "-")) %>% 
  mutate(date = ymd(ydm(paste(X1, X3, sep="-")))) %>% 
  pull(date)

good_friday_dates <- c(tbl1, tbl2)

# add dates manually
dates <- c("2018-12-05", # george bushes death 
           "2012-10-29", # Hurricane Sandy
           "2012-10-30") # Hurricane Sandy

holidays <- c(holidays, as.character(good_friday_dates), dates)


# declare holidays and weekends
create.calendar(name="mycal", 
                weekdays=c('saturday', 'sunday'),
                holidays=holidays)


rm(good_friday_dates, dates, tbl2, tbl1, year, json, public_holidays, url, domain, years)

Fundamental Indicatorse

master <- master %>% 
  mutate(year = format(date, '%Y'))
master$year <- as.numeric(master$year)

domain <- "https://www.macrotrends.net/stocks/charts"
endpoints <- c("pe-ratio", "shares-outstanding", "eps-earnings-per-share-diluted", 
               "debt-equity-ratio", "roe", "roi", "roa")

colums_needed <- c("PE Ratio", "Debt to Equity Ratio", "Return on Equity", 
                   "Return on Investment", "Return on Assets")


fundamental_indicators <- data.frame()

for (i in 1:nrow(portfolio)) {
  company_fundamentals <- data.frame()
  company_name <- portfolio [i,]$Security
  company_symbol <- portfolio [i,]$Symbol
  company_name <- str_replace_all(company_name, ' ', "-")
  for (endopoint in endpoints) {
    url <- combURL(domain, c(company_symbol, company_name, endopoint))
    print(url)
    html <- read_html(url)
    tbl <- html %>% html_nodes("table") %>% .[[1]] %>% html_table() %>% as.data.frame() 
    if (!(endopoint %in% c("shares-outstanding", "eps-earnings-per-share-diluted"))) {
      names(tbl) <- as.matrix(tbl[1, ])
      tbl <- tbl[-1, ]
      tbl[] <- lapply(tbl, function(tbl) type.convert(as.character(tbl)))
    }
    temp_df <- data.frame(date=as.character(tbl[,1]), 
                          value=tbl[,names(tbl) %in% colums_needed])
    if (ncol(temp_df) == 1) {
      temp_df <- data.frame(date=parse_number(as.character(tbl[,1])), 
                            value=parse_number(as.character(tbl[,2])))
    } else {
      temp_df <- data.frame(date=parse_number(as.character(temp_df[,1])), 
                            value=parse_number(as.character(temp_df[,2])))
    }
    names(temp_df) <- c("date", endopoint)
    if (!(endopoint %in% c("shares-outstanding", "eps-earnings-per-share-diluted"))) {
    temp_df <- temp_df %>% 
      mutate_if(is.character, ~ year(ymd(date))) 
    }
    temp_df <- temp_df %>%  filter(date %in% 2008:2020) 
    temp_df <- aggregate(temp_df[,2], list(temp_df$date), mean)
    
    names(temp_df) <- c("year", endopoint)
    if (length(company_fundamentals) == 0) {
      company_fundamentals <- temp_df
    } else {
      company_fundamentals <- temp_df %>% inner_join(company_fundamentals)
    }
    company_fundamentals$Symbol <- company_symbol
    
  }
  fundamental_indicators <- rbind(fundamental_indicators, company_fundamentals)
}  

names(fundamental_indicators) <- str_replace_all(names(fundamental_indicators), "-", "_")

rm(company_fundamentals, company_name, company_symbol, endopoint, 
   endpoints, domain, temp_df, url, colums_needed, tbl, html)

colSums(!is.na(fundamental_indicators[,2:ncol(fundamental_indicators )]))

#roi column has lots of missing values unlike all other columns
#given the small size of our data set this indicator is dropped
fundamental_indicators <- fundamental_indicators %>% 
  select(-roi)

saveRDS(fundamental_indicators, "fundamental_indicators.rds")

Indicator EDA

master %>% 
  inner_join(fundamental_indicators, by = c("Symbol", "year")) %>% 
  ggplot(., aes(x=roa)) + 
  geom_density(aes(group=GICS_Sub_Industry, fill=GICS_Sub_Industry), alpha = 0.4)

Sminconductor industry has higher roa indicating higher profitability per assets deployed.

master %>% 
  inner_join(fundamental_indicators, by = c("Symbol", "year")) %>% 
  ggplot(., aes(x=roe)) + 
  geom_density(aes(group=GICS_Sub_Industry, fill=GICS_Sub_Industry), alpha = 0.4) +  
  xlim(-50, 100)

## Warning: Removed 22 rows containing non-finite values (stat_density).

master %>% 
  inner_join(fundamental_indicators, by = c("Symbol", "year")) %>% 
  ggplot(., aes(x=pe_ratio)) + 
  geom_density(aes(group=GICS_Sub_Industry, fill=GICS_Sub_Industry), alpha = 0.4) + 
  xlim(0, 100)

## Warning: Removed 18 rows containing non-finite values (stat_density).

Application Software has a more spread out distribution of PE ratio. Semiconductor industry is most conservative in terms of price per earnings.

eps_earnings_per_share_diluted

master %>% 
  inner_join(fundamental_indicators, by = c("Symbol", "year")) %>% 
  ggplot(., aes(x=debt_equity_ratio)) + 
  geom_density(aes(group=GICS_Sub_Industry, fill=GICS_Sub_Industry), alpha = 0.4) + 
  xlim(0, 15)

## Warning: Removed 17 rows containing non-finite values (stat_density).

The semiconductor industry has lowest levels of leverage per equity available. Data Proccessing and Outsourced Services sectors has highest amount of debt.

master %>% 
  inner_join(fundamental_indicators, by = c("Symbol", "year")) %>% 
  ggplot(., aes(x=eps_earnings_per_share_diluted)) + 
  geom_density(aes(group=GICS_Sub_Industry, fill=GICS_Sub_Industry), alpha = 0.4)

Earnings per share appear to be largely the same across the industries examined.

Financial Indicator Fetching and Calculation

Log returns are used in this analysis. Using log price has a number of advantages, including ease of arithmetic manipulation. In most cases, with exception with some technical indicators, adjusted closing price is used since it takes into account corporate actions such as stock splits.

Note, some filings take place even when stock market is closed; namely during Hurricane Sandy. So we need to offset the original date also in this rare case. This makes our script more future proof.

  EXAMPLE <- master[1,]

  comp <- getSymbols(EXAMPLE$Symbol, from=bizdays::offset(EXAMPLE$date, -30, "mycal"),to=bizdays::offset(EXAMPLE$date, 30, "mycal"), auto.assign=FALSE)
  
  daterange <- paste(bizdays::offset(EXAMPLE$date, -30, "mycal"), bizdays::offset(EXAMPLE$date, 30, "mycal"), sep = "::")
  
  chartSeries(comp,
              subset=daterange,
              theme=chartTheme('white'), TA = "addTA(xts(TRUE,EXAMPLE$date),on=-1, col='blue')")

EXAMPLE <- master[1,]

getSymbols(EXAMPLE$Symbol, from=bizdays::offset(EXAMPLE$date, -30, "mycal"),to=bizdays::offset(EXAMPLE$date, 30, "mycal"))

daterange <- paste(bizdays::offset(EXAMPLE$date, -30, "mycal"), bizdays::offset(EXAMPLE$date, 30, "mycal"), sep = "::")

chartSeries(get(EXAMPLE$Symbol),
            subset=daterange,
            theme=chartTheme('white'), TA = "addTA(xts(TRUE,EXAMPLE$date),on=-1, col='blue')")

EXAMPLE <- master[2,]

getSymbols(EXAMPLE$Symbol, from=bizdays::offset(EXAMPLE$date, -30, "mycal"),to=bizdays::offset(EXAMPLE$date, 30, "mycal"))

daterange <- paste(bizdays::offset(EXAMPLE$date, -30, "mycal"), bizdays::offset(EXAMPLE$date, 30, "mycal"), sep = "::")

no_axis <- x <- chartSeries(ANSS,
            subset=daterange,
            theme=chartTheme('white'), TA = "addTA(xts(TRUE,EXAMPLE$date),on=-1, col='blue')")

EXAMPLE <- master[13,]

getSymbols(EXAMPLE$Symbol, from=bizdays::offset(EXAMPLE$date, -30, "mycal"),to=bizdays::offset(EXAMPLE$date, 30, "mycal"))

daterange <- paste(bizdays::offset(EXAMPLE$date, -30, "mycal"), bizdays::offset(EXAMPLE$date, 30, "mycal"), sep = "::")

chartSeries(ANSS,
            subset=daterange,
            theme=chartTheme('white'), TA = "addTA(xts(TRUE,EXAMPLE$date),on=-1, col='blue')")

Note, prices for 2009 are also extracted as reports filed in 2010 have an estimation period outside the bounds of the specified date range.

tickers <- master %>% pull(Symbol) %>% unique()
#list to store all prices
prices <- lapply(tickers, getSymbols, auto.assign=FALSE, from='2009-01-01',to='2020-12-31')
names(prices) <- tickers

Returns

Daily log returns are calculated. Log returns can be summed up together to get cumulative returns for a given period.

my_return <- function(x) {
    y <- dailyReturn(x, type="log")
    names(y) <- strsplit(names(x)[1], "\\.")[[1]][1]
    y
}

returns <- lapply(prices, my_return)
names(returns) <- tickers

Volume

Volume is the total number of a security traded in a given time period. However, companies have different number of outstanding shares. Thus, we first need to normalize the daily volume data to enable comparison between companies. Furthermore, similar to abnormal returns log of volume per share is used. This is done because volume is not normally distributed, breaking statistical and financial assumption (Yadav, 1992). Log transform solves this issue, but adding a small constant is required to prevent NA values at zero volume (Yadav, 1992). Additional information about the formula used can be found at: https://www.eventstudytools.com/volume-event-study

dailyVolume <- function(v, y, df) {
  shares <- df %>% filter(year == y) %>%  pull(shares_outstanding)
  vol <- log(((v + 0.00025)/shares*1000))
  vol
}

my_volume <- function(i) {
    x <- prices[[i]]
    name <- names(prices)[i]
    df <- fundamental_indicators %>% filter(Symbol == name)
    v=Vo(x)
    v$vol <- mapply(dailyVolume, v=v, year(index(x)), list(df))
    return(v[,2])
}

volume <- lapply(seq_along(prices), my_volume)
names(volume) <- tickers

Calculating

# utility function to help subset list of returns calculated earlier
subset_comp <- function(x, start_date, end_date) {
  x <- x[as.character(paste(start_date, end_date, sep = "/"))]
  as.data.frame(x)
}

# get avg market return for specified date range
# exclude target symbol
get_market_return <- function(symbol, estimation_start_date, estimation_end_date) {
  avg_market_daily_returns <- mapply(FUN = subset_comp, 
                                     x = returns[tickers[tickers != symbol]], 
                                     start_date = estimation_start_date, 
                                     end_date = estimation_end_date)
  avg_market_daily_returns  <- as.data.frame(
    matrix(unlist(avg_market_daily_returns), 
           nrow=length(unlist(avg_market_daily_returns[1]))))
  avg_market_daily_returns  <- rowMeans(avg_market_daily_returns, na.rm=T)
  return(avg_market_daily_returns)
}


# get avg market volume for specified date range
# exclude target symbol
get_market_volume <- function(symbol, estimation_start_date, estimation_end_date) {
  avg_market_daily_volume <- mapply(FUN = subset_comp, 
                                    x = volume[tickers[tickers != symbol]], 
                                    start_date = estimation_start_date, 
                                    end_date = estimation_end_date)
  avg_market_daily_volume  <- as.data.frame(
    matrix(unlist(avg_market_daily_volume), 
           nrow=length(unlist(avg_market_daily_volume[1]))))
  avg_market_daily_volume  <- rowMeans(avg_market_daily_volume)
  print(avg_market_daily_volume)
  return(avg_market_daily_volume)
}

# fit market model
get_market_model <- function(avg_market_daily_returns_in_est, comp_returns_in_est) {
  print(avg_market_daily_returns_in_est)
  print(comp_returns_in_est)
  x <- cbind(comp_returns_in_est, avg_market_daily_returns_in_est)
  x <- as.data.frame(x)
  names(x) <- c("company_returns", "market_return")
  market_model <- lm(company_returns ~ market_return, data=x)
  return(market_model) 
}

# predict using market model
calculate_normal_returns <- function(market_model, comp_returns, avg_market_daily_returns) {
  x <- cbind(comp_returns, avg_market_daily_returns)
  x <- as.data.frame(x)
  names(x) <- c("company_returns", "market_return")
  normal_returns <- predict(market_model, x)
}

#
# Main function to be used in a mapping operation.
#


returns_calc <- function(symbol, date) {
  #offset required dates
  date <- if_else(date %in% as.Date(holidays), bizdays::offset(date, 1, "mycal"), date)
  day_before <- bizdays::offset(date, -1, "mycal")
  next_day <- bizdays::offset(date, 1, "mycal")
  next_week <- bizdays::offset(day_before, 5, "mycal")
  next_month <- bizdays::offset(day_before, 21, "mycal")
  next_year <- bizdays::offset(day_before, 250, "mycal")
  prior_2_weeks <- bizdays::offset(day_before, -14, "mycal")
  estimation_start_date <- bizdays::offset(prior_2_weeks, -250, "mycal")
  
  # VOLUME
  
  # next day
  
  # market volume during estimation period
  avg_market_daily_volume <- get_market_volume(symbol, estimation_start_date, prior_2_weeks)
  # target company volume during estimation period
  interval <- as.character(paste(estimation_start_date, prior_2_weeks, sep = "/"))
  comp_volume_in_est <-  as.numeric(volume[[symbol]][interval, ])
  # market model
  market_model <- get_market_model(avg_market_daily_volume, comp_volume_in_est)
  
  # calculating normal volume after filling
  avg_market_daily_volume <- get_market_volume(symbol, date, next_day)
  interval <- as.character(paste(date, next_day, sep = "/"))
  comp_volume <-  as.numeric(volume[[symbol]][interval, ])
  normal_volume <- calculate_normal_returns(market_model, comp_volume, avg_market_daily_volume)
  
  # calculating abnormal volume after filling
  abnormal_volume <- comp_volume - normal_volume
  avg_abnormal_volume_next_day <- mean(abnormal_volume)
  cum_abnormal_volume_next_day <- sum(abnormal_volume)
  volume_direction_next_day <- ifelse(cum_abnormal_volume_next_day > 0, 1, 0)
  print("cum abnormal vol")
  print(cum_abnormal_volume_next_day)
  
  # RETURNS
  
  # market returns during estimation period
  avg_market_daily_returns_in_est <- get_market_return(symbol, estimation_start_date, prior_2_weeks)
  # target company return during estimation period
  interval <- as.character(paste(estimation_start_date, prior_2_weeks, sep = "/"))
  comp_returns_in_est <-  as.numeric(returns[[symbol]][interval, ])
  # market model
  market_model <- get_market_model(avg_market_daily_returns_in_est, comp_returns_in_est)
  
  #AFTER FILLING
  
  # next day
  
  # calculating normal returns after filling
  avg_market_daily_returns <- get_market_return(symbol, date, next_day)
  interval <- as.character(paste(date, next_day, sep = "/"))
  comp_returns <-  as.numeric(returns[[symbol]][interval, ])
  normal_returns <- calculate_normal_returns(market_model, comp_returns, avg_market_daily_returns)
  
  # calculating abnormal returns after filling
  abnormal_returns <- comp_returns - normal_returns
  avg_abnormal_return_next_day <- mean(abnormal_returns)
  cum_abnormal_return_next_day <- sum(abnormal_returns)
  direction_next_day <- ifelse(cum_abnormal_return_next_day > 0, 1, 0)
  
  # week
  
  # calculating normal returns after filling
  avg_market_daily_returns <- get_market_return(symbol, date, next_week)
  interval <- as.character(paste(date, next_week, sep = "/"))
  comp_returns <-  as.numeric(returns[[symbol]][interval, ])
  normal_returns <- calculate_normal_returns(market_model, comp_returns, avg_market_daily_returns)
  
  # calculating abnormal returns after filling
  abnormal_returns <- comp_returns - normal_returns
  avg_abnormal_return <- mean(abnormal_returns)
  cum_abnormal_return <- sum(abnormal_returns)
  variance_next_week <- var(abnormal_returns)


  #BEFORE FILLING
  
  before <- prices[[symbol]][as.character(paste(prior_2_weeks, day_before, sep = "/")), ]
  # TA indicators
  # roc - momentum indicator
  roc <- as.numeric(ROC(Ad(before),n = 7)[day_before])
  # standard moving average
  ma7 <- as.numeric(SMA(Ad(before), 7)[day_before]) 
   # rsi - momentum indicator - strength of current movement direction
  rsi <- as.numeric(RSI(Ad(before), 7)[day_before]) 
  # is a measure of the money flowing into or out of a security.
  
  obv <- as.numeric(OBV(Ad(before), Vo(before))[day_before]) 
  df <- cbind(cum_abnormal_return, avg_abnormal_return, 
              avg_abnormal_return_next_day, cum_abnormal_return_next_day,
              avg_abnormal_volume_next_day,cum_abnormal_volume_next_day, volume_direction_next_day,
              roc, ma7, rsi, obv, variance_next_week)
  names(df) <- c("cum_abnormal_return_next_week", "avg_abnormal_return_next_week", 
                 "avg_abnormal_return_next_day", "cum_abnormal_return_next_day",
                 "avg_abnormal_volume_next_day","cum_abnormal_volume_next_day", "volume_direction_next_day",
                  "roc", "ma7", "rsi", "obv", "variance_next_week")
  df
}


indicators <- master  %>% select(doc_id, Symbol, date)
x <-  mcmapply(returns_calc, symbol=indicators$Symbol, date=indicators$date)
indicators <- data.frame(cbind(as.data.frame(t(x)), indicators), row.names = 1:nrow(indicators))

rm(prices, returns, volume, tickers, end.time, start.time, 
   my_return, my_volume, returns_calc, subset_comp, i)

saveRDS(indicators, "indicators.rds")

Indicator EDA

Some of the indicators choosen are highly correlated. Therefore, one or two need to be removed to prevent multicolinearity problems.

corr <- round(cor(fundamental_indicators %>% 
                    select(roa, roe, debt_equity_ratio, 
                           eps_earnings_per_share_diluted, pe_ratio), 
                  method="spearman"), 1)
ggcorrplot(corr)

Roe and Roa are correlated; one of them should be removed before regression.

hist(indicators$avg_abnormal_return_next_day, breaks = 40)

hist(indicators$cum_abnormal_return_next_day, breaks = 40)

Examining the distribution of average abnormal return shows that in most cases there are no abnormal returns. Therefore, there is an opportunity to create a new feature – movement direction. Loss id defined as the 25th percentile of abnormal returns whilst gain is defined as everything above the 75th percentile. Everything in between is defined as stay. This feature engineering should reduce the noise in our target variable.

indicators <- indicators %>% 
  mutate(movment_direction = 
           ifelse(cum_abnormal_return_next_day < quantile(indicators$cum_abnormal_return_next_day)[2], 
                  "loss", "stay")) %>% 
  mutate(movment_direction = 
           ifelse(cum_abnormal_return_next_day > quantile(indicators$cum_abnormal_return_next_day)[4],
                  "gain", movment_direction)) %>% 
  mutate(movment_direction = factor(movment_direction))

Sentiment Calculation

Tidytext is used to get afinn, nrc, and bind dictionaries. Alternatively, the syuzhet package can be used. However, tidytext dictionaries provide more flexibility in terms of token manipulation and sentiment calculation. SentimentAnalysis package is used to get polarity with Loughran and McDonald’s, Harvard’s General Inquirer and Henry’s dictionaries as well as LM Uncertainty Ratio. The Loughran and McDonald’s as well as Henry’s dictionaries are fiance specific. Thus, it is expected that they will produce more accurate scores. Furthermore, the SentimentAnalysis package is used to fit a custom sentiment dictionary by setting cumulative returns as the response variable. Finally, sentimentr package is used to find sentiment whilst taking into consideration negation and inflection. The default dictionary is used with this package as well as a custom one based on the loughran mcdonald dictionary dictionary from the tidytext. Sentimentr function is an faster and more accurate alternative to the qdap polarity function created by the same author.

#nrc
#grab sum of emotional words to normalize emotions in next step
total <- tokens %>%
  inner_join(get_sentiments("nrc")) %>%
  filter(!(sentiment %in% c("positive", "negative"))) %>% 
  group_by(doc_id) %>% 
  count(doc_id, sentiment) %>% 
  summarize(total = sum(n)) %>% 
  pull(total)

#polarity + emotions
sent <- tokens %>%
  inner_join(get_sentiments("nrc")) %>% 
  count(doc_id,sentiment) %>%
  pivot_wider(names_from=sentiment, values_from=n) %>%
  mutate(sentiment_nrc = (positive - negative)/(positive+negative)) %>%
  select(-negative, -positive) %>% 
  mutate(across(c(2:9), .fns = ~./total)) 

#bing
sent <- tokens %>% 
  inner_join(get_sentiments("bing")) %>%
  count(doc_id,sentiment) %>%
  pivot_wider(names_from = sentiment,values_from = n) %>%
  mutate(sentiment_bing = (positive-negative)/(positive+negative)) %>%
  select(-negative, -positive)  %>% 
  inner_join(sent)

#afin
sent <- tokens %>% 
  inner_join(get_sentiments("afinn")) %>%
  group_by(doc_id) %>% 
  summarise(sentiment_afinn = sum(value))   %>% 
  inner_join(sent)

#sentimentr package

#default dictionary

text_with_punctuation <- readRDS("text_with_punctuation.rds")
text_with_punctuation  <- sentimentr::get_sentences(text_with_punctuation)
sentimentr <- sentimentr::sentiment_by(text_with_punctuation)
sent$sentimentr  <- sentimentr$ave_sentiment

#LM dictionary
lm <- tidytext::get_sentiments("loughran")
lm_key <- data.frame(
    words = lm$word,
    polarity = ifelse(lm$sentiment == "positive", 1, -1),
    stringsAsFactors = T
)

lm_key <- sentimentr::as_key(lm_key)
sentimentr_lm <- sentimentr::sentiment_by(text_with_punctuation, polarity_dt = lm_key)
sent$sentimentr_lm <- sentimentr_lm$ave_sentiment

#LM, Gi, HE, Qdap sentiment dictionaries frin SentimentAnalysis package
sentiment <- SentimentAnalysis::analyzeSentiment(master$documents_pos_tagged, 
                                                 stemming=FALSE, removeStopwords=FALSE)
sentiment$WordCount <- NULL
sent <- cbind(sent,sentiment)

#SentimentAnalysis package also provides the ability to create custom dictionary 
#this is done by aligning words to response variable

cust_dict <- master %>% 
  inner_join(indicators) %>% 
  drop_na(cum_abnormal_return_next_day) %$% 
  generateDictionary(documents_pos_tagged,cum_abnormal_return_next_day, 
                     modelType="ridge", family="binomial")

cust_dict$intercept <- NULL
cust_dict <- as.data.frame(matrix(unlist(cust_dict), nrow=length(unlist(cust_dict[1]))))
names(cust_dict) <- c("word", "sentiment", "idf")
cust_dict$sentiment <- as.numeric(cust_dict$sentiment)

sent <- tokens %>% 
  inner_join(cust_dict) %>%
  group_by(doc_id)  %>% 
  summarise(sentiment_custom = sum(sentiment)) %>% 
  inner_join(sent, by="doc_id")

#calculate add sentiment change columns to sent df
ids <- master %>% select(cik, date, doc_id)

sent_diff <- function(sentiment) {
  sentiment_change <- sentiment - lag(sentiment)
}

sent <- sent %>% 
  inner_join(ids) %>% 
  mutate(cik = factor(cik)) %>% 
  group_by(cik) %>% 
  arrange(date)  %>% 
  mutate(across(where(is.numeric), list(change = ~ sent_diff(.))))

saveRDS(sent, "sent.rds")

rm(cust_dict, sentiment, lm, ids, lm_key, sent_diff, 
   text_with_punctuation, sentimentr_lm, tokens)

meta <- master %>% select(doc_id, GICS_Sub_Industry)
data_for_regression <- sent %>% 
  inner_join(indicators, by = c("doc_id", "date")) %>% 
  inner_join(meta, by = "doc_id") %>% 
  mutate(year = year(date)) %>% 
  left_join(fundamental_indicators, by=c("Symbol", "year")) %>% 
  mutate(cik = factor(cik)) %>% 
  mutate(GICS_Sub_Industry = factor(GICS_Sub_Industry))

Analysis

In this dataset multiple entities are observed across time. Simple cross sectional analysis does not consider unobserved heterogeneity among companies, because firms might have underlying fixed factors not captured by our model. This could be anything from brand value to employee satisfaction. In other words, the data is panel data. To create a robust model this must be controlled for. This can be done through the plm package or manually. Specifying the model manually with the lm package enables further analysis with cross-validation which is not as straightforward with plm. Cross-validation allows us to check whether the model is overfitting and its general predictive power.

# function to store cross validation results in tidy format
store_cv <- function(cv, y, x, to) {
  cv$y <- y
  cv$x <- x
  temp <- data.frame(model = unlist(cv))
  temp$measure <- rownames(temp)
  temp <- pivot_wider(temp,names_from = measure, values_from = model)
  return(rbind(to, temp))
}

Function regression2DwithCVis takes a list of dependent variables, independent variables, and control variables. Regression is done on each pair of dependent and independent variables whilst controlling for specified variables. 10-Fold cross validation is preformed and results stored using function defined above. Stargazer is used to display the results for regression and cross validation.

cv_resutls <- data.frame()

regression2DwithCV <- function(dv_names, iv_names, controls, name) {
  
  sentiment_models <- list()
  
  for (y in dv_names){
    for (x in iv_names) {
      x <- paste(x, control, sep = "+")
      form <- formula(paste(y, "~", x))
      model <- lm(form, data=data_for_regression,  x = TRUE, y = TRUE) 
      sentiment_models[[y]][[x]] <- model
      cv_model <- cv.lm(model, k = 10, seed = 123, max_cores = detectCores() - 1)
      cv_resutls <<- store_cv(cv_model, y, x, cv_resutls)
    }
  }
  
  for (y in sentiment_models) {
    #link <- combURL("PartB", paste0(all.vars(formula(y[[1]]))[1], "_", name, ".html"))
    #print(name)
    #stargazer::stargazer(y, type = "html", omit="cik", out = link)
    #print(tab_model(y, collapse.ci = TRUE, collapse.se = TRUE))
  }
}

control <- c("GICS_Sub_Industry")
control <- paste(control, collapse = '+')

dv_names <- c("cum_abnormal_return_next_week", "avg_abnormal_return_next_week", 
                 "avg_abnormal_return_next_day", "cum_abnormal_return_next_day",
                 "avg_abnormal_volume_next_day","cum_abnormal_volume_next_day", 
              "volume_direction_next_day", "variance_next_week")

iv_names <- c("roe", "roa", "debt_equity_ratio", 
              "eps_earnings_per_share_diluted", 
              "pe_ratio", "roc", "ma7", "rsi", "obv")

control <- c("GICS_Sub_Industry")
control <- paste(control, collapse = '+')

regression2DwithCV(dv_names,iv_names, control, name="base_model-1")

iv_names <- c("roe + ma7", "roa + ma7", "debt_equity_ratio + ma7", 
              "eps_earnings_per_share_diluted + ma7", 
              "pe_ratio + ma7", "roc + ma7", "rsi + ma7", "obv + ma7")

regression2DwithCV(dv_names,iv_names, control, name="base_model-2")


control <- c("cik", "GICS_Sub_Industry", "debt_equity_ratio", "ma7")
control <- paste(control, collapse = '+')

iv_names <- c("sentiment_nrc", 
             "sentiment_bing", 
             "sentiment_afinn", 
             "SentimentGI", 
             "SentimentHE", 
             "SentimentLM", 
             "SentimentQDAP", 
             "RatioUncertaintyLM", 
             "sentimentr_lm", 
             "sentimentr", 
             "sentiment_custom")

regression2DwithCV(dv_names,iv_names, control, name="just_sentiment")

iv_names <- c("sentiment_nrc_change", 
             "sentiment_bing_change", 
             "sentiment_afinn_change", 
             "SentimentGI_change", 
             "SentimentHE_change", 
             "SentimentLM_change", 
             "SentimentQDAP_change", 
             "RatioUncertaintyLM_change", 
             "sentimentr_lm_change", 
             "sentimentr_change", 
             "sentiment_custom_change")

regression2DwithCV(dv_names,iv_names, control, name="sentiment_change")


iv_names <- c("NegativityGI", 
             "PositivityGI", 
             "NegativityHE", 
             "PositivityHE", 
             "PositivityLM", 
             "NegativityLM", 
             "NegativityQDAP", 
             "PositivityQDAP")

regression2DwithCV(dv_names,iv_names, control, name="neg_pos")

iv_names <- c("NegativityGI_change", 
             "PositivityGI_change", 
             "NegativityHE_change", 
             "PositivityHE_change", 
             "PositivityLM_change", 
             "NegativityLM_change", 
             "NegativityQDAP_change", 
             "PositivityQDAP_change")

regression2DwithCV(dv_names,iv_names, control, name="neg_pos_change")

iv_names <- c("anger", 
             "anticipation", 
             "disgust", 
             "fear", 
             "joy", 
             "sadness", 
             "surprise")

regression2DwithCV(dv_names,iv_names, control, name="emotion")
regression2DwithCV(dv_names, paste(iv_names, collapse = '+' ), control)


iv_names <- c("anger_change", 
             "anticipation_change", 
             "disgust_change", 
             "fear_change", 
             "joy_change", 
             "sadness_change", 
             "surprise_change")

regression2DwithCV(dv_names,iv_names, control, name="emotion_change")
regression2DwithCV(dv_names, paste(iv_names, collapse = '+' ), control)

iv_names <- c("SentimentLM", 
             "NegativityLM", 
             "disgust", 
             "disgust_change", 
             "sadness_change", 
             "sentiment_nrc")
regression2DwithCV(dv_names, paste(iv_names, collapse = '+' ), control, name="combined")

saveRDS(cv_resutls, "PartB/cv_results.rds")

cv_results <- readRDS("PartB/cv_results.rds")

cvs <- cv_results %>% 
    group_by(y) %>% 
    arrange(MAE.mean) %>% 
    top_n(5) %>% 
    select(MAE.mean, MAE.sd, y, x) %>% 
    group_split()

## Selecting by x

#%>% 
    #knitr::kable() %>% 
    #kable_styling(position = "center")

cvs <- lapply(cvs, as.data.frame)

#stargazer::stargazer(cvs, summary = rep(F,length(cvs)), type = "text",no.space=TRUE)

for (table in cvs) {
  print(knitr::kable(table))
}

MAE.mean	MAE.sd	y	x
0.0091782386533657	0.00150966428059187	avg_abnormal_return_next_day	surprise_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.00918275577718206	0.0014864845553651	avg_abnormal_return_next_day	sentimentr_lm_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.00937887457994281	0.00157252778146152	avg_abnormal_return_next_day	sentimentr_lm+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.00942693109529079	0.00163577620450724	avg_abnormal_return_next_day	surprise+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.00944016648277091	0.00157738589935523	avg_abnormal_return_next_day	sentimentr+cik+GICS_Sub_Industry+debt_equity_ratio+ma7

MAE.mean	MAE.sd	y	x
0.00488252030219267	0.000792097772451372	avg_abnormal_return_next_week	sentimentr_lm+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.00488480213504792	0.000827350956256574	avg_abnormal_return_next_week	sentimentr+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.0048971316189137	0.000803705505237141	avg_abnormal_return_next_week	surprise+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.00506523396665765	0.00101059136594638	avg_abnormal_return_next_week	sentimentr_lm_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.00511027038066066	0.000973603470904424	avg_abnormal_return_next_week	surprise_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7

MAE.mean	MAE.sd	y	x
0.285823179466672	0.0332135958018523	avg_abnormal_volume_next_day	sentimentr_lm_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.286055675928028	0.0305144248874301	avg_abnormal_volume_next_day	sentimentr_lm+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.286720532002201	0.0303513796748044	avg_abnormal_volume_next_day	surprise+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.28773818023958	0.02961373174355	avg_abnormal_volume_next_day	sentimentr+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.289037804342597	0.034031508505347	avg_abnormal_volume_next_day	surprise_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7

MAE.mean	MAE.sd	y	x
0.0183564773067314	0.00301932856118374	cum_abnormal_return_next_day	surprise_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.0183655115543641	0.00297296911073021	cum_abnormal_return_next_day	sentimentr_lm_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.0187577491598856	0.00314505556292304	cum_abnormal_return_next_day	sentimentr_lm+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.0188538621905816	0.00327155240901447	cum_abnormal_return_next_day	surprise+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.0188803329655418	0.00315477179871046	cum_abnormal_return_next_day	sentimentr+cik+GICS_Sub_Industry+debt_equity_ratio+ma7

MAE.mean	MAE.sd	y	x
0.0246264358413286	0.00396814378728267	cum_abnormal_return_next_week	sentimentr_lm+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.0246437007683787	0.00415640509804959	cum_abnormal_return_next_week	sentimentr+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.024726079602924	0.0040109198300447	cum_abnormal_return_next_week	surprise+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.0255328776878948	0.0051408946575127	cum_abnormal_return_next_week	sentimentr_lm_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.0257899431787833	0.00492203229343267	cum_abnormal_return_next_week	surprise_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7

MAE.mean	MAE.sd	y	x
0.571646358933344	0.0664271916037046	cum_abnormal_volume_next_day	sentimentr_lm_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.572111351856057	0.0610288497748603	cum_abnormal_volume_next_day	sentimentr_lm+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.573441064004402	0.0607027593496088	cum_abnormal_volume_next_day	surprise+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.575476360479161	0.0592274634870999	cum_abnormal_volume_next_day	sentimentr+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.578075608685193	0.0680630170106941	cum_abnormal_volume_next_day	surprise_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7

MAE.mean	MAE.sd	y	x
0.000230415809367263	6.85018531782221e-05	variance_next_week	sentimentr_lm+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.000230964051486379	6.90107158300945e-05	variance_next_week	sentimentr+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.000231354207990225	6.22479921024627e-05	variance_next_week	surprise+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.000232941853469224	4.48720128309396e-05	variance_next_week	surprise_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.000236066889363565	3.90054950919969e-05	variance_next_week	sentimentr_lm_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7

MAE.mean	MAE.sd	y	x
0.409198809719714	0.0395613828604193	volume_direction_next_day	sentimentr_lm_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.409492450027264	0.0423304681838998	volume_direction_next_day	surprise_change+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.420026866804643	0.0460438835979588	volume_direction_next_day	surprise+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.420072765324612	0.0464012455806698	volume_direction_next_day	sentimentr_lm+cik+GICS_Sub_Industry+debt_equity_ratio+ma7
0.420807588254073	0.0455501247626208	volume_direction_next_day	sentimentr+cik+GICS_Sub_Industry+debt_equity_ratio+ma7

Surprise_change is best at predicting cum_abnormal_return_next_day and avg_abnormal_return_next_day. Sentimentr_lm and sentimentr_lm_change is leads in terms of predictive power for all other financial indicators. Note, however, none of these variables are statisticaly significant according to regression results. For detailed results of regression analysis see Appendix.

control <- c("cik", "GICS_Sub_Industry", "debt_equity_ratio", "ma7")
control <- paste(control, collapse = '+')
sentiment_models <- list()
dv_names_binary <- c("movment_direction")

iv_names <- c("sentiment_nrc", 
             "sentiment_bing", 
             "sentiment_afinn", 
             "SentimentGI", 
             "SentimentHE", 
             "SentimentLM", 
             "SentimentQDAP", 
             "RatioUncertaintyLM", 
             "sentimentr", 
             "NegativityGI", 
             "PositivityGI", 
             "NegativityHE", 
             "PositivityHE", 
             "PositivityLM", 
             "NegativityLM", 
             "NegativityQDAP", 
             "PositivityQDAP",
             "sentiment_nrc_change", 
             "sentiment_bing_change", 
             "sentiment_afinn_change", 
             "SentimentGI_change", 
             "SentimentHE_change", 
             "SentimentLM_change", 
             "SentimentQDAP_change", 
             "RatioUncertaintyLM_change", 
             "sentimentr_change", 
             "NegativityGI_change", 
             "PositivityGI_change", 
             "NegativityHE_change", 
             "PositivityHE_change", 
             "PositivityLM_change", 
             "NegativityLM_change", 
             "NegativityQDAP_change", 
             "PositivityQDAP_change")

for (y in dv_names_binary){
  for (x in iv_names) {
    x <- paste(x, control, sep = "+")
    form <- formula(paste(y, "~", x))
    sentiment_models[[y]][[x]] <- glm(form, data=data_for_regression, family = "binomial") 
  }
}

name <- "test"

for (y in sentiment_models) {
  name <- combURL("PartB", paste0(all.vars(formula(y[[1]]))[1], "-", name, ".tex"))
  stargazer::stargazer(y, type = "latex", out = name)
}

Part C: Topic Modeling

A topic is a bag of words where each word is assigned a probability of belonging to the topic. A document consists of multiple topics of varying proportions.

In this part of the study we use Structural Topic modeling to discover topics in the textual data. The main advantage of STM over LDA or CTM is that STM allows users to include document metadata into the model (Roberts, 2016). The topic proportions prevalence and topic content in a document can thus be associated with metadata.

The topics discussed by management depend on the industry the company is in as was clearly illustrated in Part A tf-idf token exploration. Likely the time at which the report was written also has an effect on topics mentioned. For instance in 2020 reports we expect topics related to hygiene to be more prevalent. Therefore we include sub-industry and temporal variables as stm model covariates. Primarily, we want to examine the relationship between the return on a stock and the topics discussed. Therefore cumulative abnormal returns are added as a prevalence factor. Additionally, company features such as price to earnings ratio and debt to equity ratio are included since they were found to be good predictors of abnormal returns. For instance, it is possible that management of a company with a higher debt to equity ratio will focus on debt in the management discussion section of the report. Technical indicators are not included since they only reflect the short term nature of a company’s price movements rather than some intrinsic company state. Therefore, there is no plausible mechanism by which they can affect topics discussed.

According to Stewart (2020) prevalence covariates are not particularly sensitive to the number of metadata used, however content covariates are. Therefore whilst adding a number of prevalence covariates appears reasonable altering numerous content covariates less so. Moreover, adding content covariates removes the ability to carry out Search K. Since, selecting k is perhaps the most important decision in this type of analysis, no content covariates are added.

Spectral initialization is recommended by Roberts et al., (2016). It outperforms LDA and random initialization. Spectral initialization also returns consistent topics by focusing on anchor words (Mourtgos & Adams, 2019). According to stm documentation a rough optimal guess of optimal topic numbers is from up to 50 for a corpus of a few 100s documents. Therefore, search K in this study is carried out for k between 2 to 60.

Stm output is given in the form of words ranked by frequency, probability, lift and score in a topic. To label topics the main focus is on FREX measure which weights the frequency of a word’s appearance with its exclusiveness to the topic. A conceptually parallel to tf-idf can be drawn here.

Data Preperation

meta <- master %>% 
  select(cik, GICS_Sub_Industry, documents_pos_tagged, date) %>% 
  mutate(year = year(date), cik = factor(cik), GICS_Sub_Industry = factor(GICS_Sub_Industry))  

data_for_stm <- data_for_regression %>% 
  select(cum_abnormal_return_next_week, cum_abnormal_return_next_day, debt_equity_ratio, pe_ratio, cik,  year) %>% 
  inner_join(meta, by=c("cik", "year")) %>% select(-cik, -date) 

rm(data_for_regression)

processed <- textProcessor(data_for_stm$documents_pos_tagged,
                           metadata = data_for_stm,
                           stem = F)

## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Removing numbers... 
## Creating Output...

threshold <- round(1/100 * length(processed$documents),0)

out <- prepDocuments(processed$documents,
                     processed$vocab,
                     processed$meta,
                     lower.thresh = threshold)

## Removing 64 of 3169 terms (166 of 203687 tokens) due to frequency 
## Your corpus now has 326 documents, 3105 terms and 203521 tokens.

Search K

k_values = seq(from=2,to=60,by=2)
search_k_results <- searchK(out$documents, out$vocab, K=k_values,
                            N = floor(0.1*length(out$documents)),
                            prevalence = ~cum_abnormal_return_next_day + 
                              factor(cik) + 
                              s(year) + 
                              GICS_Sub_Industry + 
                              pe_ratio + 
                              debt_equity_ratio,
                            cores = 2,
                            data=out$meta)

k_values = seq(from=2,to=12,by=1)
search_k_results_deep <- searchK(out$documents, out$vocab, K=k_values,
                            N = floor(0.1*length(out$documents)),
                            prevalence = ~cum_abnormal_return_next_day + 
                              s(year) + 
                              factor(cik) + 
                              GICS_Sub_Industry + 
                              pe_ratio + 
                              debt_equity_ratio,
                            cores = 2,
                            data=out$meta)

k_values = seq(from=8,to=20,by=1)
search_k_results_deep <- searchK(out$documents, out$vocab, K=k_values,
                            N = floor(0.1*length(out$documents)),
                            prevalence = ~cum_abnormal_return_next_day + 
                              s(year) + 
                              factor(cik) + 
                              GICS_Sub_Industry + 
                              pe_ratio + 
                              debt_equity_ratio,
                            cores = 2,
                            data=out$meta)

Search K

Unsurprisingly, semantic coherence is very high when few topics are present. This is due to the statistical properties of the technique (Mimno, 2011)(Roberts, 2014). Thus, peaks beyond the initial high values should be looked at. The leveling off of Held-Out Likelihoood and Lower bound curve as well as the through in the residual and peak in semantic coherence plot all suggest a optimal value for K equal to 9.

optimal_k <- 9

Model Tune

optimal_k_models <- selectModel(documents = out$documents, 
                                vocab = out$vocab,
                                K =  optimal_k,
                                prevalence = ~cum_abnormal_return_next_day + 
                                  s(year) + 
                                  GICS_Sub_Industry + 
                                  pe_ratio + 
                                  debt_equity_ratio,
                                max.em.its = 150,
                                gamma.prior='L1',
                                data = out$meta,
                                init.type = "Spectral", 
                                ngroups = 5)

Comparing Models

For the most topics models have almost the same values for semantic coherence and exclusivity. However, for one of the topics model 2 & 3 outperform all others in terms of semantic coherence by a substantial margin. Model 2 is selected for futher alnalyis as it performs better on exclusivity in some of the other cases.

model <- optimal_k_models$runout[[2]]
save(model, file="PartC/stm_optimal_model.rda")

Results

tidy_summary <- data.frame(FREX = do.call(paste, 
                                          c(as.data.frame(summary(model)$frex), sep=", ")),
                           Lift = do.call(paste, 
                                          c(as.data.frame(summary(model)$lift), sep=", ")),
                           Score = do.call(paste, 
                                           c(as.data.frame(summary(model)$score), sep=", ")),
                           Prob = do.call(paste, 
                                          c(as.data.frame(summary(model)$prob), sep=", "))) %>% 
  mutate(topic = row_number()) %>% 
  select(topic, everything())

## A topic model with 9 topics, 326 documents and a 3105 word dictionary.
## A topic model with 9 topics, 326 documents and a 3105 word dictionary.
## A topic model with 9 topics, 326 documents and a 3105 word dictionary.
## A topic model with 9 topics, 326 documents and a 3105 word dictionary.

labels <- c("Lawsuits", "Operationw Abroad" ,  "Software & Hardware", 
            "Accounting Standards", "Distribution and Manufacturing",
            "Insurance", "Digital Royalty", "Card Rewards", "Subscription")

tidy_summary[,2:5] %>% 
  knitr::kable(booktabs = TRUE) %>% 
  pack_rows(index = paste("Topic", c(1:9), ":", labels)) %>% 
  kable_styling(latex_options = "scale_down")

FREX	Lift	Score	Prob
Topic 1 : Lawsuits
class, litigation, court, escrow, complaint, interchange, plaintiff	accountants, covered, disagreements, circumvention, overriding, benefits, signatory	accountants, class, plaintiff, complaint, escrow, covered, defendant	company, asset, share, income, class, common, settlement
Topic 2 : Operationw Abroad
consumer, transfer, agent, region, negatively, versus, euro	reside, arbitrator, constructive, expectancy, generator, illicit, impediment	lines, consumer, border, check, segment, processing, agent	rate, revenue, income, currency, foreign, expense, business
Topic 3 : Software & Hardware
hardware, percent, support, software, update, system, premise	decreasing, deflationary, elements, optimistic, picture, post-combination, quantified	hardware, deflationary, support, subscription, software, middleware, license	revenue, service, software, expense, product, customer, support
Topic 4 : Accounting Standards
non-gaap, investor, mutual, company, proxy, communication, measure	academia, accountable, forecasts, imprecise, reaction, sub-section, biomedical	company, chemistry, non-gaap, proxy, earnings, perpetual, investor	company, revenue, income, expense, related, result, rate
Topic 5 : Distribution and Manufacturing
client, payroll, insurance, fund, processing, online, solution	intermediate, quicken, calculate, centralize, embezzlement, exhaustive, garnishment	client, centralize, payroll, worker, segment, processing, service	revenue, service, income, client, rate, total, business
Topic 6 : Insurance
wafer, distributor, inventory, fabrication, memory, manufacturing, shipment	advantages, averse, bifurcation, constantly, controllers, dense, enthusiast	wafer, distributor, fabrication, inventory, memory, gigabit, semiconductor	product, income, expense, rate, primarily, market, result
Topic 7 : Digital Royalty
digital, royalty, device, creative, media, wireless, circuit	authoring, acrobat, advertiser, cost-sensitive, foregone, hobbyist, localization	risky, subscription, wireless, creative, circuit, modem, acrobat	revenue, related, primarily, income, product, expense, increase
Topic 8 : Card Rewards
fuel, mile, label, reward, private, spread, redemption	accessory, accordion, apparel, bankrupt, branded, coalition, collector	fuel, mile, label, conduit, reward, fleet, grocery	credit, revenue, rate, increase, income, expense, transaction
Topic 9 : Subscription
maintenance, subscription, observable, billing, professional, input, privately	ample, convention, correct, diligent, drawdown, forint, freight	subscription, maintenance, forint, hardware, perpetual, upfront, seat	revenue, product, cost, increase, service, expense, asset

tidy_gamma <- tidy(model, matrix = "gamma", document_names = rownames(out$meta))

tidy_gamma %>%
  group_by(topic) %>%
  summarise(gamma = mean(gamma)) %>%
  arrange(desc(gamma)) %>%
  mutate(topic = reorder(topic, gamma)) %>% 
  ggplot(aes(topic, gamma, label = labels, fill = topic)) +
  geom_col(show.legend = FALSE, alpha = 0.8) +
  geom_text(hjust = 1.2, nudge_y = 0.0005, size = 10, color='white') +
  coord_flip() +
  theme_light(base_size = 22)  +
  labs(x = NULL, y = expression(gamma),
       title = "Top topics by prevalence in 10-K reports")

ggsave("PartC/topic_proportions_in_corpus.png", 
       width = 50, height = 35, units = "cm")

Topic Proportions across Industries

tidy_gamma  %>% 
  pivot_wider(id_cols=document, names_from = topic, values_from = gamma) %>% 
  cbind(meta) %>% 
  select(-documents_pos_tagged, -year, -cik) %>% 
  group_by(GICS_Sub_Industry) %>% 
  summarise(across(where(is.numeric), mean)) %>% 
  pivot_longer(!GICS_Sub_Industry, names_to = "topic", values_to = "gamma") %>% 
  mutate(topic = factor(topic)) %>% 
  ggplot() + 
  geom_bar(aes(x = topic, y = gamma, fill =topic), alpha = 0.8, stat = "identity") + 
  facet_wrap(.~GICS_Sub_Industry) + 
  theme_light(base_size = 22) + 
  theme(strip.background=element_blank(), 
        strip.text=element_text(colour = 'black', face = "bold", size = 17)) + 
  xlab("Topic") + ylab("Mean Gamma") + 
  scale_fill_discrete(name="Legend",labels=labels)

ggsave("PartC/topic_proportions_across_industries.png", 
       width = 50, height = 35, units = "cm")

On this graph variation in topics across industries are seen.

Topic Proportions across Industries

Document distributions across topics

#exclude all docs with prob less than 1 percent
#allows for better examination
ggplot(tidy_gamma %>% filter(gamma > 0.01), 
       aes(gamma, fill = as.factor(topic))) +
  geom_histogram(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~ topic, ncol = 3) +
  labs(title = "Document probabilities distribution per topic",
       y = "Number of reports", x = expression(gamma)) +
  theme_light(base_size = 22)

ggsave("PartC/document_probabilities_distribution.png",
       width = 45, height = 45, units = "cm")

Topic Proportions across Industries

Each topic is strongly associated with some of the documents and less so with other.

effects_return <- estimateEffect(1:optimal_k ~debt_equity_ratio, stmobj = model, meta = out$meta)

margin1 <- as.numeric(quantile(out$meta$debt_equity_ratio)[2])
margin2 <- as.numeric(quantile(out$meta$debt_equity_ratio)[4])

plot(effects_return, covariate = "debt_equity_ratio",
     topics = 1:optimal_k,
     model = model, method = "difference",
     cov.value1 = margin2, cov.value2 = margin1,
     xlab = "Low Debt ... High Debt",
     xlim = c(-0.01,0.01),
     main = "Marginal change on topic probabilities for low and high price",
     custom.labels = labels,
     ci.level = 0.05,
     labeltype = "custom")

effects_return <- estimateEffect(1:optimal_k ~pe_ratio, stmobj = model, meta = out$meta)

margin1 <- as.numeric(quantile(out$meta$pe_ratio)[2])
margin2 <- as.numeric(quantile(out$meta$pe_ratio)[4])

plot(effects_return, covariate = "pe_ratio",
     topics = 1:optimal_k,
     model = model, method = "difference",
     cov.value1 = margin2, cov.value2 = margin1,
     xlab = "Low PE ... High PE",
     xlim = c(-0.01,0.01),
     main = "Marginal change on topic probabilities for low and high PE ratio",
     custom.labels = labels,
     ci.level = 0.05,
     labeltype = "custom")

effects_return <- estimateEffect(1:optimal_k ~cum_abnormal_return_next_day, stmobj = model, meta = out$meta)

margin1 <- as.numeric(quantile(out$meta$cum_abnormal_return_next_week)[2])
margin2 <- as.numeric(quantile(out$meta$cum_abnormal_return_next_week)[4])

plot(effects_return, covariate = "cum_abnormal_return_next_day",
     topics = 1:optimal_k,
     model = model, method = "difference",
     cov.value1 = margin2, cov.value2 = margin1,
     xlab = "Low Price ... High Price",
     xlim = c(-0.05,0.05),
     main = "Marginal change on topic probabilities for low and high price",
     custom.labels = labels,
     ci.level = 0.05,
     labeltype = "custom")

effects_year <- estimateEffect(1:optimal_k ~s(year), stmobj = model, meta = out$meta)

plot(effects_year, covariate = "year",
     topics = 1:optimal_k,
     model = model, method = "continuous",
     xlab = "Past ... Present",
     main = "Marginal change on topic probabilities across years",
     custom.labels =labels,
     ci.level = 0.05,
     labeltype = "custom")

topic_correlations <- topicCorr(model) 
plot.topicCorr(topic_correlations,
               vlabels = labels,
               vertex.color = "#CDF0EA", 
               vertex.label.cex =01, 
               vertex.size=30, 
               vertex.label.color="#053742")

These charts depict the marginal probability of topic prevalence as variable changes.

Regression

tidy_theta <- as.data.frame(model$theta)
colnames(tidy_theta) <- paste0("topic_",1:9)
tidy_theta <- cbind(out$meta,tidy_theta)
topics <- paste0("topic_", 1:9)
iv_names <- paste(c(topics , "cik"), collapse = " + ")
lm_model <- lm(paste("cum_abnormal_return_next_day ~", iv_names), tidy_theta)
tidy(lm_model) %>% kable() %>%  
  kable_styling(position = "center")

term	estimate	std.error	statistic	p.value
(Intercept)	-0.0045039	0.0093844	-0.4799320	0.6316400
topic_1	-0.0146199	0.0095478	-1.5312351	0.1268090
topic_2	0.0064744	0.0068529	0.9447689	0.3455688
topic_3	0.0070618	0.0082127	0.8598637	0.3905794
topic_4	0.0112073	0.0087023	1.2878567	0.1988297
topic_5	-0.0081481	0.0079406	-1.0261292	0.3056918
topic_6	-0.0000164	0.0058805	-0.0027942	0.9977724
topic_7	0.0024948	0.0076385	0.3266045	0.7442043
topic_8	0.0037274	0.0079790	0.4671466	0.6407482
topic_9	NA	NA	NA	NA
cik4127	0.0025809	0.0116364	0.2217923	0.8246328
cik6281	0.0066921	0.0116626	0.5738114	0.5665433
cik723125	0.0079608	0.0115370	0.6900222	0.4907359
cik723531	-0.0002769	0.0123163	-0.0224848	0.9820768
cik743316	-0.0024732	0.0117525	-0.2104429	0.8334708
cik743988	0.0052937	0.0115885	0.4568068	0.6481543
cik769397	0.0162905	0.0116651	1.3965126	0.1636355
cik779152	0.0046571	0.0116263	0.4005662	0.6890365
cik796343	-0.0005136	0.0116500	-0.0440898	0.9648633
cik798354	0.0077926	0.0116638	0.6681061	0.5046009
cik804328	0.0012401	0.0116892	0.1060883	0.9155861
cik813672	0.0042170	0.0115978	0.3636017	0.7164222
cik827054	0.0068285	0.0115704	0.5901682	0.5555406
cik849399	-0.0005927	0.0115451	-0.0513378	0.9590919
cik877890	0.0086757	0.0116911	0.7420753	0.4586465
cik883241	0.0059439	0.0115503	0.5146102	0.6072201
cik896878	0.0052314	0.0116270	0.4499312	0.6530985
cik1013462	-0.0076748	0.0116263	-0.6601238	0.5097020
cik1045810	-0.0085638	0.0116545	-0.7348037	0.4630569
cik1101215	-0.0043800	0.0116667	-0.3754310	0.7076163
cik1108524	0.0070212	0.0116741	0.6014353	0.5480232
cik1123360	-0.0122103	0.0118764	-1.0281202	0.3047560
cik1136893	0.0076943	0.0115983	0.6634022	0.5076036
cik1141391	0.0034293	0.0116159	0.2952265	0.7680335
cik1175454	0.0043529	0.0119242	0.3650481	0.7153434
cik1341439	0.0146612	0.0115752	1.2666097	0.2063183
cik1365135	0.0033617	0.0117230	0.2867624	0.7745004
cik1383312	0.0322571	0.0116135	2.7775440	0.0058367
cik1403161	-0.0001431	0.0115294	-0.0124122	0.9901054

glance(lm_model) %>% kable() %>% 
  kable_styling(position = "center")

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual	nobs
0.1124705	-0.0015524	0.0269775	0.9863856	0.4966375	37	735.3848	-1392.77	-1245.081	0.2096024	288	326

None of the topics are usefull in predicting abnormal returns.

Unsupervised Model

auto_stm_model <- stm(documents = out$documents, 
                      vocab = out$vocab,
                      K = 0,
                      prevalence =~cum_abnormal_return_next_day + s(year) + 
                        factor(cik) + 
                        GICS_Sub_Industry + 
                        pe_ratio + 
                        debt_equity_ratio, 
                      max.em.its = 150,
                      gamma.prior='L1',
                      data = out$meta,
                      init.type = "Spectral", 
                      ngroups = 5)

save(auto_stm_model, file="PartC/auto_stm_model.rda")

load("PartC/auto_stm_model.rda")
tidy_summary <- data.frame(FREX = do.call(paste, c(as.data.frame(summary(auto_stm_model)$frex), sep=", ")),
                           Lift = do.call(paste, c(as.data.frame(summary(auto_stm_model)$lift), sep=", ")),
                           Score = do.call(paste, c(as.data.frame(summary(auto_stm_model)$score), sep=", ")),
                           Prob = do.call(paste, c(as.data.frame(summary(auto_stm_model)$prob), sep=", "))) %>% 
  mutate(topic = row_number()) %>% 
  select(topic, everything())

## A topic model with 55 topics, 326 documents and a 3105 word dictionary.
## A topic model with 55 topics, 326 documents and a 3105 word dictionary.
## A topic model with 55 topics, 326 documents and a 3105 word dictionary.
## A topic model with 55 topics, 326 documents and a 3105 word dictionary.

tidy_summary[,2:5] %>% 
  knitr::kable(booktabs = TRUE) %>% 
  pack_rows(index = paste("Topic", c(1:55))) %>% 
  kable_styling(latex_options = "scale_down")

FREX	Lift	Score	Prob
Topic 1
executive, disclosure, control, decision, resource, addition, accounting	accountants, control, disclosure, executive, decision, officer, resource	accountants, control, executive, decision, disclosure, officer, resource	control, disclosure, executive, decision, addition, accounting, management
Topic 2
class, court, plaintiff, complaint, defendant, benefits, motion	benefits, preliminarily, investigative, declaratory, settled, co-defendant, unlawful	benefits, class, plaintiff, complaint, defendant, escrow, court	company, asset, class, share, common, income, note
Topic 3
client, payroll, insurance, fund, administration, worker, associate	centralize, topics, ancillary, attendance, paid, intermediate, garnishment	client, payroll, insurance, check, worker, centralize, remittance	client, service, investment, rate, income, fund, revenue
Topic 4
limitation, procedure, circumvention, effectiveness, inherent, possibility, constraint	circumvention, procedure, accountant, constraint, error, misstatement, possibility	circumvention, procedure, effectiveness, absolute, error, constraint, control	control, reporting, reasonable, procedure, internal, limitation, effectiveness
Topic 5
mutual, proxy, communication, earnings, advisor, consisting, investor	closed, nontaxable, post-retirement, newsletter, registrar, contests, non-trade	proxy, mutual, earnings, closed, advisor, non-gaap, investor	revenue, increase, company, expense, income, rate, earning
Topic 6
accompanying, merger, check, line, unfavorable, trademark, network	combat, pronouncements, lines, restrictive, commits, incoming, membership	combat, check, accompanying, merger, vertical, trademark, membership	income, credit, revenue, asset, rate, period, expense
Topic 7
update, hardware, support, software, comparison, education, license	computationally, annualize, arranger, post-combination, middleware, shortly, codification	hardware, middleware, update, computationally, software, education, support	revenue, product, software, expense, hardware, support, rate
Topic 8
mutual, earnings, proxy, outsourcing, distribution, communication, broker	correspondent, weights, non-compliance, securities, archival, piece, midrange	correspondent, earnings, proxy, mutual, outsourcing, broker, client	revenue, company, operation, service, income, agreement, increase
Topic 9
description, emulation, warrant, maintenance, hardware, restructuring, conversion	cryptocurrency, convention, industrialized, product-specific, blueprint, logical, contemplation	hardware, emulation, maintenance, cryptocurrency, description, warrant, conversion	product, revenue, change, note, cost, related, asset
Topic 10
class, pension, nominal, claim, sustainable, convert, differential	defendants, honor, panel, violate, complaints, pre-trial, omnibus	defendants, class, interchange, escrow, pension, client, litigation	company, income, asset, share, class, common, note
Topic 11
implementation, standard, standalone, stream, practical, delivery, outsourcing	deflationary, randomly, multi-element, organizations, non-essential, invested, survey	deflationary, survey, outsourcing, processing, hardware, remittance, maintenance	revenue, service, cost, customer, related, company, product
Topic 12
microprocessor, graphics, amendment, indenture, chipset, processor, shipment	averse, derivation, enthusiast, freedom, meaningfully, dense, semi-custom	microprocessor, graphics, wafer, dense, shipment, semi-custom, chipset	expense, related, primarily, amount, product, income, decrease
Topic 13
system, independent, report, management, event, effective, accounting	disagreements, system, independent, public, management, report, control	system, disagreements, independent, report, public, control, internal	system, independent, management, report, future, accounting, effective
Topic 14
distributor, signal, assembly, microcontroller, debenture, semiconductor, capacity	distributors, inappropriate, offshore, rare, interface, uncommon, signal	distributor, microcontroller, wafer, debenture, semiconductor, assembly, memory	product, distributor, approximately, income, cost, acquisition, amount
Topic 15
insurance, client, payroll, worker, worksite, fund, administration	embezzlement, flex, worksite, usual, facing, renovation, overnight	client, insurance, embezzlement, worksite, worker, payroll, non-gaap	client, service, income, rate, investment, fund, share
Topic 16
return, allowances, pre-tax, research, title, derivative, contingent	breakout, end-customer, exclusivity, indicators, non-warranty, differences, post-shipment	exclusivity, inventory, distributor, wafer, non-warranty, research, shipment	income, asset, revenue, expense, primarily, rate, product
Topic 17
maintenance, perpetual, license, upfront, professional, chip, criterion	ample, hotline, drawdown, forint, fronts, mentioned, post-customer	maintenance, perpetual, upfront, hardware, shipment, chip, functionally	revenue, license, increase, customer, term, service, cost
Topic 18
fuel, wholesale, organic, fleet, spread, macroeconomic, network	toll, undivided, acceptability, diagram, expansive, fleets, gallon	fuel, wholesale, fleet, organic, transportation, spread, gallon	revenue, income, transaction, rate, fuel, facility, impact
Topic 19
mile, reward, label, database, private, cardholder, redemption	grocery, fashion, woman, furnishings, email, trusts, apparel	mile, reward, label, cardholder, database, breakage, sponsor	credit, increase, rate, service, revenue, expense, mile
Topic 20
seat, geography, non-gaap, subscription, reseller, maintenance, suite	curricula, deploy, digitally, disciplinary, downloadable, educator, expositions	subscription, non-gaap, seat, maintenance, horizontal, geography, reseller	revenue, product, increase, expense, business, cost, primarily
Topic 21
consumer, agent, transfer, location, region, rating, paper	intra-country, re-balancing, uncertainties, imposing, migrant, constructive, consumers	intra-country, consumer, agent, border, rating, transfer, region	revenue, rate, business, transaction, consumer, foreign, income
Topic 22
non-gaap, simulation, perpetual, maintenance, operational, lease, investor	sub-section, academia, accountable, biomedical, chemical, chemistry, copyright	non-gaap, maintenance, perpetual, company, simulation, investor, matrix	company, revenue, income, expense, related, result, rate
Topic 23
label, private, mile, reward, redemption, sponsor, collector	merchandise, harbor, catalog, label, year, collector, permission	mile, label, reward, private, collector, breakage, merchandise	credit, service, revenue, rate, mile, private, reward
Topic 24
circuit, wireless, device, royalty, spectrum, marketable, patent	multimode, -process, circuits, codec, laptops, messaging, nonfunctional	wireless, circuit, device, multimode, licensee, royalty, marketable	revenue, related, rate, product, primarily, increase, asset
Topic 25
communication, proxy, wealth, mutual, investor, retirement, earnings	multi-asset, borrowed, wealth, post-employment, mailing, sell, clearance	multi-asset, wealth, proxy, earnings, mutual, non-gaap, investor	company, revenue, service, management, income, activity, asset
Topic 26
percent, subscription, professional, invoice, renewal, billing, non-gaap	co-location, contributor, motivate, multi-tenant, undeveloped, impending, parking	subscription, non-gaap, percent, absolute, invoice, multi-tenant, billing	revenue, service, expense, customer, total, percent, increase
Topic 27
border, rebate, euro, versus, issuer, local, incentive	nonstandard, numerical, reconcile, non-european, encouraging, neutral, warning	neutral, border, rebate, euro, versus, cardholder, non-gaap	currency, expense, revenue, foreign, income, rate, customer
Topic 28
premise, hardware, license, index, infrastructure, support, swap	non-oracle, summation, interoperable, interest, host, firmware, upward	hardware, premise, non-oracle, index, swap, marketable, deployment	revenue, license, service, expense, rate, currency, hardware
Topic 29
debit, guarantee, check, intrusion, ticket, channel, associations	ticket, non-routine, resell, restaurants, intrusion, associations, dishonored	check, ticket, debit, intrusion, non-routine, associations, interchange	service, credit, facility, rate, income, revenue, loss
Topic 30
banking, check, consolidation, swap, incremental, variance, processing	corrupt, disrupt, non-traditional, sophistication, steal, surviving, detrimental	check, banking, client, earnings, processing, swap, non-traditional	revenue, service, rate, operation, income, period, business
Topic 31
comprehensive, damage, court, objection, derivative, liabilities, incentive	objection, unadjusted, argument, appellate, alleged, intra-entity, allegedly	objection, interchange, complaint, damage, plaintiff, class, court	asset, income, company, loss, share, liability, foreign
Topic 32
overriding, internal, reporting, procedure, effectiveness, control, misstatement	overriding, inadequate, misstatement, detection, authorizations, procedure, fairly	overriding, procedure, misstatement, reporting, control, effectiveness, internal	control, internal, reporting, management, share, procedure, acquisition
Topic 33
storage, content, server, authoritative, joint, indemnification, undelivered	controversy, authentication, liberal, outweigh, partnering, taxpaying, tolling	subscription, storage, partnering, authoritative, server, rebate, content	revenue, asset, amount, income, cost, expense, primarily
Topic 34
profit, reserve, material, military, equivalent, research, uncertainty	amplifier, precision, revolution, smartphones, multitude, pascal, prohibitive	pascal, military, inventory, cellular, research, medical, erosion	expense, related, revenue, rate, income, result, asset
Topic 35
desktop, investments, online, segment, staffing, unsecured, payroll	exhaustive, nimbly, patient, quicken, suspicious, employ, pre-established	desktop, patient, quicken, staffing, segment, payroll, online	revenue, income, service, business, expense, segment, total
Topic 36
creative, developer, redundant, stable, prepayment, termination, restructuring	shippable, foregone, perpetually, hobbyist, redundant, download, localization	creative, perpetually, acrobat, redundant, developer, subscription, element	revenue, product, related, primarily, income, expense, cost
Topic 37
digital, media, subscription, document, creative, backlog, offering	personalize, cost-sensitive, subscribe, syncing, advertiser, trajectory, photography	subscription, media, digital, creative, document, perpetual, personalize	revenue, increase, income, digital, primarily, foreign, subscription
Topic 38
implementation, outsourced, hardware, complementary, support, electronic, element	picture, decreasing, optimistic, unwavering, non-exclusive, aircraft, bullet	hardware, outsourced, outsourcing, element, picture, installation, complementary	revenue, service, cost, customer, support, product, software
Topic 39
hardware, update, software, premise, support, comparison, subscription	protect, -aservice, elements, instructor, perfunctory, rational, agility	hardware, subscription, update, software, premise, support, storage	software, revenue, product, hardware, support, service, expense
Topic 40
distributor, microcontroller, assembly, half, auction, capacity, debenture	purely, serial, proposal, fail, pre-determined, non-proprietary, unrelated	distributor, microcontroller, purely, wafer, debenture, memory, inventory	product, distributor, market, income, result, investment, approximately
Topic 41
restructure, realizable, dram, volume, decline, equipment, manufacture	re-use, restructure, forecasting, outpace, rolling, qualification, non-trade	restructure, dram, outpace, realizable, memory, qualification, inventories	product, primarily, cost, amount, income, increase, rate
Topic 42
unsecured, contingent, special, action, employment, demand, distributor	instrumentation, roadmaps, tenor, sizing, sold, injury, predominant	distributor, inventory, predominant, roadmaps, industrial, categorization, shipment	result, income, rate, increase, product, revenue, amount
Topic 43
consumer, transfer, negatively, agent, region, paper, strengthening	arbitrator, interconnected, multi-strategy, varied, expectancy, saving, illicit	consumer, saving, agent, strengthening, peso, pension, region	rate, revenue, currency, foreign, income, consumer, business
Topic 44
division, input, workspace, observable, authoritative, collaboration, virtualization	duplication, login, mitigate, observation, password, shrink, turnaround	workspace, division, subscription, authoritative, desktop, maintenance, virtualization	product, revenue, service, related, primarily, asset, cost
Topic 45
covered, retrospective, litigation, responsibility, escrow, settlement, interchange	sponsoring, covered, misstatements, responsibility, retrospective, escrow, non-controll	covered, sponsoring, responsibility, litigation, escrow, interchange, retrospective	litigation, covered, retrospective, note, responsibility, settlement, provision
Topic 46
transition, provisional, quantitative, pandemic, adoption, distinct, form	stockholders, coronavirus, unsatisfied, pandemic, shutdown, non-distributor, capable	stockholders, provisional, pandemic, distinct, perpetual, transition, enactment	income, change, obligation, amount, rate, time, transition
Topic 47
company, report, presentation, management, statement, event, assumption	supervision, company, presentation, reclassification, principle, public, report	supervision, company, translation, report, indefinite, audit, public	company, management, statement, report, future, acquisition, accounting
Topic 48
notebook, architecture, processor, workstation, game, marketable, warranty	builder, custodian, motherboard, multi-core, municipality, navigation, recall	inventory, notebook, marketable, rebate, processor, graphics, visual	product, revenue, income, market, cost, related, expense
Topic 49
processing, senior, institution, unconsolidated, subscriber, banking, electronic	acumen, distinction, sizable, accompany, convey, mild, perception	client, processing, transit, unconsolidated, subscriber, banking, thrift	revenue, service, rate, income, payment, business, expense
Topic 50
gigabit, dram, memory, venture, flash, production, joint	underutilized, severe, tech, multi-chip, density, verdict, successively	gigabit, underutilized, dram, memory, flash, wafer, tech	product, cost, primarily, result, average, expense, acquisition
Topic 51
assembly, microcontroller, distributor, signal, fabrication, wafer, semiconductor	uninsured, gate, virus, dispersion, expirations, wider, adopt	distributor, microcontroller, uninsured, wafer, assembly, fabrication, semiconductor	product, rate, acquisition, distributor, customer, facility, result
Topic 52
redemption, loyalty, conduit, deposit, mile, consent, offs	unredeemed, prevailing, expiry, eliminations, restitution, non-executive, regression	mile, unredeemed, loyalty, conduit, reward, expiry, redemption	credit, increase, rate, expense, revenue, asset, program
Topic 53
gigabit, dram, supply, flash, memory, output, production	width, creditor, gigabits, gigabit, successively, yuan, wind	gigabit, dram, memory, wafer, flash, fabrication, width	product, cost, result, primarily, agreement, amount, average
Topic 54
comparable, debenture, broadcast, family, shipment, distributor, mainstream	foreign-currency, withhold, prom, lengthy, published, salable, wireline	debenture, distributor, broadcast, shipment, inventory, wireless, mainstream	revenue, product, income, period, rate, market, increase
Topic 55
divestiture, unallocated, identity, protection, enterprise, billing, transition	writing, divestiture, identity, unallocated, non-operat, exceptions, varying	divestiture, writing, identity, unallocated, billing, protection, metric	revenue, primarily, income, expense, result, cost, operation

tidy_theta <- as.data.frame(auto_stm_model$theta)
colnames(tidy_theta) <- paste0("topic_",1:55)
tidy_theta <- cbind(out$meta,tidy_theta)

topics <- paste0("topic_", 1:55)
iv_names <- paste(c(topics , "cik"), collapse = " + ")
lm_model <- lm(paste("cum_abnormal_return_next_day ~", iv_names), tidy_theta)
tidy(lm_model) %>% kable() %>%  
  kable_styling(position = "center")

term	estimate	std.error	statistic	p.value
(Intercept)	-0.0342238	0.0174691	-1.9591011	0.0512492
topic_1	-0.1601652	0.9372345	-0.1708913	0.8644521
topic_2	0.0161565	0.0270823	0.5965688	0.5513530
topic_3	0.0283555	0.0201120	1.4098833	0.1598580
topic_4	0.1070051	0.1016611	1.0525676	0.2935890
topic_5	0.0349806	0.0230135	1.5200022	0.1298158
topic_6	0.0164919	0.0216484	0.7618089	0.4469159
topic_7	0.0260029	0.0236328	1.1002862	0.2723007
topic_8	0.0494177	0.0206808	2.3895485	0.0176355
topic_9	0.0255111	0.0182664	1.3966191	0.1638077
topic_10	0.0571828	0.0258001	2.2163796	0.0275961
topic_11	0.0660352	0.0213440	3.0938550	0.0022079
topic_12	0.0261464	0.0176575	1.4807478	0.1399745
topic_13	-0.6844999	1.1688602	-0.5856132	0.5586812
topic_14	0.0236586	0.0208292	1.1358359	0.2571489
topic_15	0.1090007	0.0244369	4.4605055	0.0000125
topic_16	0.0316088	0.0186996	1.6903499	0.0922486
topic_17	0.0282858	0.0173990	1.6257179	0.1053112
topic_18	0.0365042	0.0174952	2.0865307	0.0379777
topic_19	-0.0001363	0.0226825	-0.0060069	0.9952121
topic_20	0.0376618	0.0174977	2.1523796	0.0323567
topic_21	0.0335991	0.0217613	1.5439838	0.1238989
topic_22	0.0380281	0.0173324	2.1940433	0.0291836
topic_23	0.0168625	0.0234769	0.7182595	0.4732902
topic_24	0.0348916	0.0169743	2.0555549	0.0408986
topic_25	0.0273443	0.0253257	1.0797059	0.2813481
topic_26	0.0312590	0.0176186	1.7742063	0.0772863
topic_27	0.0209410	0.0184292	1.1362966	0.2569564
topic_28	0.0677562	0.0222270	3.0483765	0.0025563
topic_29	0.0219950	0.0204805	1.0739480	0.2839157
topic_30	0.0345708	0.0176023	1.9639920	0.0506757
topic_31	-0.0021999	0.0335033	-0.0656622	0.9477010
topic_32	0.0047848	0.1118821	0.0427665	0.9659229
topic_33	0.0281372	0.0189613	1.4839287	0.1391291
topic_34	0.0319834	0.0173598	1.8423853	0.0666421
topic_35	0.0279150	0.0171583	1.6269065	0.1050583
topic_36	0.0127701	0.0201107	0.6349897	0.5260350
topic_37	0.0346392	0.0207871	1.6663786	0.0969318
topic_38	0.0388209	0.0189901	2.0442700	0.0420094
topic_39	0.0365428	0.0265403	1.3768774	0.1698225
topic_40	0.0172476	0.0261905	0.6585459	0.5108135
topic_41	0.0377398	0.0203535	1.8542170	0.0649248
topic_42	0.0248161	0.0180613	1.3739901	0.1707160
topic_43	0.0331811	0.0189071	1.7549500	0.0805333
topic_44	0.0364368	0.0178258	2.0440551	0.0420308
topic_45	0.0994105	0.1427843	0.6962284	0.4869539
topic_46	0.0250350	0.0417817	0.5991848	0.5496102
topic_47	0.0684657	0.2844716	0.2406769	0.8100093
topic_48	0.0459670	0.0179390	2.5624016	0.0110015
topic_49	0.0281820	0.0176058	1.6007185	0.1107439
topic_50	0.0133341	0.0245449	0.5432550	0.5874543
topic_51	0.0768139	0.0237003	3.2410519	0.0013582
topic_52	0.0497146	0.0219592	2.2639504	0.0244631
topic_53	0.0307229	0.0228948	1.3419172	0.1808806
topic_54	0.0345447	0.0173587	1.9900518	0.0477105
topic_55	NA	NA	NA	NA
cik4127	0.0050528	0.0125512	0.4025697	0.6876202
cik6281	0.0047687	0.0124193	0.3839731	0.7013355
cik723125	0.0074893	0.0122164	0.6130522	0.5404175
cik723531	0.0044749	0.0134251	0.3333207	0.7391808
cik743316	0.0028395	0.0122850	0.2311381	0.8174028
cik743988	0.0033442	0.0124527	0.2685524	0.7885029
cik769397	0.0181427	0.0121326	1.4953623	0.1361228
cik779152	0.0058436	0.0122335	0.4776735	0.6333138
cik796343	0.0007463	0.0121797	0.0612720	0.9511932
cik798354	0.0107927	0.0120591	0.8949898	0.3716818
cik804328	-0.0082341	0.0121683	-0.6766851	0.4992520
cik813672	0.0012690	0.0121904	0.1041005	0.9171758
cik827054	0.0060625	0.0123319	0.4916091	0.6234413
cik849399	0.0038317	0.0125188	0.3060772	0.7598090
cik877890	-0.0045308	0.0126916	-0.3569895	0.7214108
cik883241	0.0046864	0.0123963	0.3780467	0.7057273
cik896878	0.0062140	0.0123736	0.5022019	0.6159821
cik1013462	-0.0010103	0.0125133	-0.0807342	0.9357201
cik1045810	-0.0091426	0.0123840	-0.7382571	0.4610735
cik1101215	-0.0098667	0.0122508	-0.8053917	0.4213843
cik1108524	0.0092281	0.0121335	0.7605490	0.4476668
cik1123360	-0.0153089	0.0125179	-1.2229583	0.2225350
cik1136893	0.0044118	0.0124185	0.3552601	0.7227042
cik1141391	-0.0040708	0.0120817	-0.3369387	0.7364551
cik1175454	0.0057457	0.0126414	0.4545141	0.6498662
cik1341439	0.0153888	0.0126262	1.2187986	0.2241074
cik1365135	0.0063853	0.0124522	0.5127884	0.6085671
cik1383312	0.0302267	0.0128217	2.3574562	0.0191976
cik1403161	-0.0006274	0.0123547	-0.0507804	0.9595424

glance(lm_model) %>% kable() %>% 
  kable_styling(position = "center")

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual	nobs
0.276611	0.0285065	0.0265696	1.114897	0.2616751	83	768.7174	-1367.435	-1045.549	0.1708383	242	326

Results, Limitations and Conclusion.

The management section of 10-K downloaded reports for the 30 companies was successfully extracted. The corpus built consists of 326 reports out of a possible total of 330 spanning the years of 2010-2020. Reports were cleaned using rvest and regex. Parsing errors were removed using hunspell package, frequency filtering and token length filtering. Stopwords were removed using a number of dictionaries, including finance specific ones. Text was POS tagged and nouns, adverbs and adjectives kept for further analysis.

Event study methodology was used to calculate abnormal returns, abnormal volume and variance of abnormal returns. Cumulative abnormal returns, average abnormal returns, cumulative abnormal volume and average abnormal volume for windows of length 2 days and 5 days were used as target variables. Variance of abnormal returns for window length 5 was also used. To create a baseline model fundamental indicators and technical indicators were used. To test whether sentiment has an effect on prices and return, multiple finance specific and non finance specific dictionaries were evaluated. Change in sentiment from last year’s report was also tested as a predictor variable. In total 572 models were fitted.

Although sentiment scores appeared to be statistically significant on a number of occasions and the model adjusted R squared improved the results were not consistent. P values were generally large and the sentiment scores often had different signs. For most of the dependent variables best performing models were baseline models consisting of fundamental and technical indicators. This inconsistency became even more apparent after conducting cross validation. Best performing models were ones that used the sentimentr algorithm combined with the LM dictionary. This performance is encouraging as it is consistent with theory. A finance specific dictionary combined with an advanced algorithm that takes into account negation should produce the most accurate results. However,despite this performance on cross validation the sentimentr-lm results are not statistically significant in predicting abnormal returns. Moreover, even the “custom” dictionary fitted with returns as response variable performs poorly on cross-validation. Therefore, there is little evidence to suggest that sentiment in the management section of 10-K reports can influence prices and even less that it can be used to make decisions such as building trading strategies. Nonetheless, it is important to note that this study only uses data for 326 reports and 30 companies in total. Including all major companies for a longer time period might yield different results. Furthermore, including other sections of the reports such Item 1 “Business” and Item 1A “Risk Factors” may be a fruitful avenue for future research as including them will increase text size and enable the algorithm to capture a larger proportion of sentiment.

Topic modelling was conducted using the both supervised and unsupervised stm approach. To select K values from 4-60 were explored. Looking at semantic coherence optimal k was selected to be 9. Company fundamentals and industry features such as pe-ratio and debt to equity ratio were used as prevalence factors. The unsupervised approach resulted in 55 topics. Unfortunately both approaches did not yield topics that had statistically significant results on cumulative abnormal returns.

Bibliography

Feldman, R., Govindaraj, S., Livnat, J., & Segal, B. (2010). Management’s tone change, post earnings announcement drift and accruals. Review of Accounting Studies, 15(4), 915-953.

MacKinlay, A. C. (1997). Event studies in economics and finance. Journal of economic literature, 35(1), 13-39.

Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011, July). Optimizing semantic coherence in topic models. In Proceedings of the 2011 conference on empirical methods in natural language processing (pp. 262-272).

Mourtgos and Adams. “The rhetoric of de-policing: Evaluating open-ended survey responses from police officers with machine learning-based structural topic modeling” Journal of Criminal Justice. 2019.

Stewart BM. 2020. Comment on: Github, T. Non-Atomic Vectors as Metadata? #212 [Online]. Comment posted on 9 Feb 2020. Available from: https://github.com/bstewart/stm/issues/212

Roberts ME, Stewart BM, Tingley D, Lucas C, Leder-Luis J, Gadarian SK, Albertson B, Rand DG (2014). “Structural Topic Models for Open-Ended Survey Responses.” American. Journal of Political Science, 58(4), 1064–1082. doi:10.1111/ajps.12103.

Yadav, P. K. (1992). Event studies based on volatility of returns and trading volume: A review. The British Accounting Review, 24(2), 157-184.

Yan, Q., 2020. Notes for “Text Mining with R: A Tidy Approach”. [ebook] Available at: https://bookdown.org/Maxine/tidy-text-mining/.

Package References

Hadley Wickham (2020). rvest: Easily Harvest (Scrape) Web Pages. R package version 0.3.6. https://CRAN.R-project.org/package=rvest

Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686

Garrett Grolemund, Hadley Wickham (2011). Dates and Times Made Easy with lubridate. Journal of Statistical Software, 40(3), 1-25. URL http://www.jstatsoft.org/v40/i03/

Silge J, Robinson D (2016). “tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” JOSS, 1(3). doi: 10.21105/joss.00037 (URL: https://doi.org/10.21105/joss.00037), <URL: http://dx.doi.org/10.21105/joss.00037>.

Hadley Wickham (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0. https://CRAN.R-project.org/package=stringr

Stefan Milton Bache and Hadley Wickham (2020). magrittr: A Forward-Pipe Operator for R. R package version 2.0.1. https://CRAN.R-project.org/package=magrittr

Hadley Wickham (2020). httr: Tools for Working with URLs and HTTP. R package version 1.4.2. https://CRAN.R-project.org/package=httr

Mario Annau (2015). tm.plugin.webmining: Retrieve Structured, Textual Data from Various Web Sources. R package version 1.3. https://CRAN.R-project.org/package=tm.plugin.webmining

Microsoft and Steve Weston (2020). foreach: Provides Foreach Looping Construct. R package version 1.5.1. https://CRAN.R-project.org/package=foreach

Microsoft Corporation and Steve Weston (2020). doParallel: Foreach Parallel Adaptor for the ‘parallel’ Package. R package version 1.0.16. https://CRAN.R-project.org/package=doParallel

Rinker, T. W. (2018). lexicon: Lexicon Data version 1.2.1. http://github.com/trinker/lexicon

Jeffrey A. Ryan and Joshua M. Ulrich (2020). quantmod: Quantitative Financial Modelling Framework. R package version 0.4.18. https://CRAN.R-project.org/package=quantmod

Jan Wijffels (2020). udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the ‘UDPipe’ ‘NLP’ Toolkit. R package version 0.8.5. https://CRAN.R-project.org/package=udpipe

Wilson Freitas (2021). bizdays: Business Days Calculations and Utilities. R package version 1.0.8. https://CRAN.R-project.org/package=bizdays

Alboukadel Kassambara (2019). ggcorrplot: Visualization of a Correlation Matrix using ‘ggplot2’. R package version 0.1.3. https://CRAN.R-project.org/package=ggcorrplot

Joshua Ulrich (2020). TTR: Technical Trading Rules. R package version 0.24.2. https://CRAN.R-project.org/package=TTR

R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Roberts ME, Stewart BM, Tingley D (2019). “stm: An R Package for Structural Topic Models.” Journal of Statistical Software, 91(2), 1-40. doi: 10.18637/jss.v091.i02 (URL: https://doi.org/10.18637/jss.v091.i02)

Ian Fellows (2018). wordcloud: Word Clouds. R package version 2.6. https://CRAN.R-project.org/package=wordcloud

Nicolas Proellochs and Stefan Feuerriegel (2021). SentimentAnalysis: Dictionary-Based Sentiment Analysis. R package version 1.3-4. https://CRAN.R-project.org/package=SentimentAnalysis

Posthuma Partners (2019). lmvar: Linear Regression with Non-Constant Variances. R package version 1.5.2. https://CRAN.R-project.org/package=lmvar

Jeroen Ooms (2021). magick: Advanced Graphics and Image-Processing in R. R package version 2.7.2. https://CRAN.R-project.org/package=magick

Jeffrey B. Arnold (2019). ggthemes: Extra Themes, Scales and Geoms for ‘ggplot2’. R package version 4.2.0. https://CRAN.R-project.org/package=ggthemes

Hao Zhu (2021). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.3.4. https://CRAN.R-project.org/package=kableExtra

Individual Assignment Cover Sheet.

Submitted by: 2036451

Module Title: Text Analytics

Module Code: “IB9CW0”

Year of Module: “2021”

Number of Pages: 131

Abstract

Libraries

Part A

Portfolio Construction

Sector Selection

Get the Data

Fetch Master Index Files

Parse Master Index Files

Dowload Files for Selected Industries

Regex Manual Extraction

Select Companies

Final Portfolio

NLP pipeline

Text Normalisation

POS Tagging

Stopword Removal

Term Frequency Filtering

Parsing Error Removal

Putting it all together

TF-IDF Analysis

Unigrams

GICS_Sub_Industry: Frequency

GICS_Sub_Industry: Tf-Idf

Bigrams

Trigrams

Part B

Calendar non-trading day adjustments

Fundamental Indicatorse

Indicator EDA

Financial Indicator Fetching and Calculation

Returns

Volume

Calculating

Indicator EDA

Sentiment Calculation

Analysis

Part C: Topic Modeling

Data Preperation

Search K

Model Tune

Results

Topic Proportions across Industries

Document distributions across topics

Regression

Unsupervised Model

Results, Limitations and Conclusion.

Bibliography

Package References

Appendix