Drafted bayes report

2026-04-11 10:07:12 -05:00 · 2024-10-16 15:50:07 -04:00
parent 6473765329
commit f0eb89cd4f
6 changed files with 215 additions and 15 deletions
--- a/report/report.pdf
+++ b/report/report.pdf
--- a/report/report.tex
+++ b/report/report.tex
@@ -4,6 +4,8 @@
 \usepackage{amsmath}
 \usepackage{amssymb}
 \usepackage[a4paper, total={6in, 10in}]{geometry}
 \usepackage{setspace}
 \setstretch{1.25}
 \hyphenpenalty 1000
 \begin{document}
@@ -33,7 +35,24 @@
 \newpage
 % Begin report
 \section{Objective}
-yada yada yah I started this independent study for my own selfish gain
+The educational focus of Implementations of Probability Theory surrounds the application of data
 models that produce non-deterministic insights through probabilistic methodology. By pursuing this
 study I hope to gain a deeper understanding of how to apply data in risk calculation for mitigation
 scenarios as they appear in real life, rather than the experimental lab conditions that enable algorithmic
 certainty.
 In contrast to the path of black-box artificial intelligence and algorithms taught in \textbf{CSCI 335: Machine Learning}, this study is tailored to methods
 designed to produce confidence levels for uncertain events using certain terms, leveraging logical,
 traceable, and definite, calculations. Current course offerings in the realm of data science focus largely on
 the storing and management of data, and it is noted that the cluster of data science was until very recently
 under the branding of data management. Implementations of Probability Theory is intended to extend
 learnings in previous courses, notably \textbf{CSCI 420: Principles of Data Mining}, for more advanced algorithms
 used at the intersection of data and computing after the preprocessing stage.
 After beginning this study the intended deliverable outline was determined to be technically implausible and has been replaced with 
 demonstrations of applied algorithms.  Taking inspiration from the retinal mosaic as displayed in \textbf{CSCI 431: Intro to Computer Vision} 
 and discussion in \textbf{IGME 589: Computational Creativity and Algorithmic Art} on the appearance and nature of randomness in graphics, I hope to create 
 a program that can determine the liklihood that randomly distributed colors on a hexagonal grid appear as they do in an image.
 \newpage
 \section{Units}
@@ -155,4 +174,139 @@ To calculate standard error, kys.
 Statistical Inference is any data analysis to draw conclusions from a sample to make assertions about the population.  
 Methods include estimation via averages and confidence intervals, and hypothesis testing, which attempts to invalidate (never \textit{validate}) a hypothesis.
 \newpage
 \subsection{Unit 2: Probabilistic Theories and Epistemology}
 When developing probabilistic models it is vital to use domain expertise to expose the product to the full range of external variables that would be expected 
 of a model applied to the real world.  Without an appropriate understanding of both the limitations in research procedures and the true value of the data collected, 
 the integrity of the model becomes inherently compromised.  
 As data scientists, we are uniquely at risk of falling for this trap because it is hard to fully grasp domain expertise when the nature of data science 
 in a business setting frequently means consulting for many separate projects with a collectively massive scope.  Of equal consideration, it is also easy 
 to assume that the sophistication of our tools overrides imperfections in the data, in spite of mantras like 'Garbage In, Garbage Out'.
 In this unit I explored some common fallacies and assumptions held by analysts who may not fully grasp the content that they work with, 
 nor the problems they intend to solve.  This required extensive research that I found was best digested in the form of books whose chapters chronicle multiple 
 examples of a given principle.  As such, the reading was not confined to just the timeslot designated for this unit.  Research started during the months leading up 
 to the start of the semester\footnote{Only research during the semester was logged in the timesheet} and have continued through the independent study.  This structure was particularly helpful to pull me back and gain perspective of what 
 my goal was when I was knee-deep in feature construction and model formulation.
 \subsubsection{Moral Hazards and The Bob Rubin Trade}
 Picking pennies in front of a steamroller.
 When studying the effectiveness of a model the scope of review must capture the entire range of the sample space.  Discarding black swans that don't impact 
 the client does not mean the results will not reflect on the client for an oversight.  There is therefore a question of obligation for data scientists to include 
 flags for significant events in reality that do not effect the proposed course of action to the client.
 The 2009 recession, attributed to the collapse of the housing market bubble, is the most common example of a moral hazard because the displacement of risk from 
 banks who were federally required to give subprime loans to the taxpayer meant that banks could profit from subprime loans but would not be harmed when the inevitable 
 occurred.  In popular media, the housing bubble bursting is attributed to the banks where those in the industry passed off the event as something that nobody could 
 have forseen.\footnote{For instance, in the 2015 movie \textit{The Big Short}, only a few savvy traders who bothered to look into the details find that banks had, 
 in their ignorance, built the bundled mortgages on an unstable foundation.}  In reality, banks only ignored a probablistic eventuality because their models did not 
 need to account for such an event.
 Most emphasize the problems with risk transferrence when creating models.  For this study's purposes, the important learning is that probablistic models should not 
 drop evaluations as soon as an event leaves the scope of the immediate client.
 \subsubsection{Ignoring Improbable Outliers with Outsized Impact}
 In machine learning it is common for algorithms to drop the most extreme (or a random selection of) datapoints to avoid overfitting and errors in data collection.  
 One issue with the current implementation of this procedure is that it is often done blindly, ignorant of information that these outliers may relay.  For instance, 
 in a selection of 300 water samples from a stream, all but a few show a normal amount of oxygen in the stream.  A citizen scientist may discount the remaining pockets 
 as a statistical implausibility that is most likely indicative of a failure in sample testing and drop the most extreme 5\% of datapoints.  
 However, if these few pockets show a complete disruption of the dissolution process, the vast majority of aquatic life in the stream will eventually pass through 
 these pockets without oxygen and die, resulting in an outsized impact from just a few sources.
 Nassim Taleb in \textit{Fooled By Randomness} describes this event with an analogy to Russian Roulette: If there was a 5/6 chance of winning a million dollars and a 
 1/6 chance of killing yourself, many people would at least hesitate before pulling the trigger.  But what if the barrel is 10,000 rounds and it was only a 
 1/10,000 chance of harm?  In this case, many less-than-rational actors use the game repeatedly to acquire wealth indefinitely, forgetting or even outright ignorant 
 that eventually the unlikely, or, as the actor would see it, the unthinkable, happens and all of the gains are completely negated.
 \subsubsection{Fooled By Randomness}
 May justify its own subsection since the others acknowledge small probabilities whereas this is outright randomness.
 \subsubsection{Lindy Effect}
 "For the perishable, every additional day in its life translates into a shorter additional life expectancy. 
 For the nonperishable, every additional day may imply a longer life expectancy."
 A tool that is proven is more likely to stand the test of time than a new tool replacing it since it is unproven.  
 "The robustness of an item is proportional to its life!"
 "Inaccurate science\ldots is constantly being published. The Lindy-conscious consumer of scientific data will take seriously only 
 information that has held up over a period of time."\footnote{\url{https://www.nytimes.com/2021/06/17/style/lindy.html}} 
 \subsubsection{Decision Theory}
 Decision theory is the study of how people make decisions with uncertain information.  There are two main branches of decision theory:
 \subsubsection*{Normative/Rational Decision Theory}
 This branch studies how people \textit{should} make decisions.  In problems with other actors, as in game theory, it is assumed that all other actors will also 
 act with perfect rationality, allowing for precise calculation of the actions of all of the others and their expected utility to the agent.
 \subsubsection*{Descriptive Decision Theory}
 This branch studies how people actually make decisions which includes factors such as psychological and emotional biases.
 \subsubsection{Info Gap Decisions}
 In info gap decision theory there is not enough information to assign probabilities to events and the goal is to select a course of action that is robust in the 
 face of uncertainty.  Where decision theory can predict expectations in irrationality to determine expected values, info gap decisions approximate the range of 
 probabilities and weight them to estimate expected value.  In essence, it applies probabilities to probabilities, adding an additional layer to insulate calculations 
 from a lack of data or lack of understanding of a topic.
 \subsubsection{Methodology Considerations}
 Given I have taken 10134023 instances of the last 40 years, all of which Obama has been alive, I can say with a high degree of certainty that Obama is immortal.
 An event never occurring in history does not discount its possiblity of occurring in the future.  Similarly, events that may have been impossible in the past 
 are not necessarily impossible in the future.
 Also, psychology.  Someone who knows they are being studied will act differently than someone who isn't being studied so models will be inaccurate.
 \newpage
 \subsection{Unit 3: Bayesian Statistics}
 This unit was deliberately separated from statistical review due to the percieved complexity of the topic and the magnitude of usage in recent data science 
 breakthroughs.  Bayes Theorem is a part of the cirriculum for both \textbf{MATH 351 - Probability and Statistics} and \textbf{CSCI 420 - Principles of Data Mining}.  
 However, as both approached the topic from different perspectives and while neither solidified my personal confidence in its use, I chose to take extra time to learn 
 this important topic in my own way.  
 It has been said that statistics does not come naturally to the human brain, hence statistics is, by mathematical standards, a 
 young discipline.  Resulting research on Bayesian statistics has led me to the conclusion that the opposite may be true - Bayes Theorem is quite intuitive, but 
 its discipline has not had the time to crystallize best practices for instructing it.  For instance, updating one's beliefs to compare probabilities with the 
 number of documented occurrences is frequently used in philosophical discussion in the form of explanations that subsets with high liklihood of fufilling terms 
 are valid classifications even when the subset size results in overall fufilled terms to be infrequently categorized as the proposed subset.  Most people understand 
 these expressions but, when shown a table and how to calculate those ratios, the content enters the realm of collegiate instruction.
 \subsubsection{Bayes Theorem}
 The equation for Bayes Theorem is as follows:
 \[
 P(A|E) = \frac{P(A) * P(E|A)}{P(A) * P(E|A) + (1 - P(A)) * P(E|\neg A)}
 \]
 This formula appears more complex as it is.  The denominator, while directly translating to "The probability of A times the probability of event E occuring in A 
 divided by the probability of A times the probability of event E occuring in A plus the probability of not A times the probability of E occuring in not A" 
 can be more easily expressed simply as \(P(E)\) or the probability of event E occuring.  
 By utilizing venacular more familiar to everyday life, Bayes Theorem can be translated into:
 \[
 \text{P(occurence came from category)} = \frac{\text{\# of occurences from category}}{\text{total \# of occurences}}
 \]
 Finally, this equation is updated to replace descriptions with technical terms:
 \[
 \text{Posterior Probability} = \frac{\text{prior} * \text{likelihood}}{\text{Evidence}}
 \]
 Even this equation can be misconstrued as a number of arrangements of ratios involving total occurrences from a category or non-occurrences from outside 
 of the category so as a final demonstration, the sample space will be visualized geometrically
 \footnote{Concept credit to 3Blue1Brown on Youtube, this video is what finally clarified in my mind what the equation behind Bayes Theorem meant.\\
 \url{https://www.youtube.com/watch?v=HZGCoVF3YvM}} as a 1 unit by 1 unit square.
 \subsubsection{Bayesian Updating}
 Bayesian Updating is another term that has been added to buzzword vocabulary to describe a process that isn't directly related to Bayesian Statistics but appears 
 to have been rediscovered by academia through study of applied Bayes Theorem.  In essence, Bayesian Updating simply states that observed occurrences should not 
 override previous evidence and that it should instead be added to it in equal weight (equal value being a naive assumption).  This evidence updating makes 
 applications of Bayes Theory calculate posterior probabilities continuously as new information enters the system rather than a calculation that is only done once.
 \subsubsection{Bayesian Belief Networks}
 Bayesian Belief Networks are probablistic graphical models that preserve conditional dependence between random variables.  In spite of its name, 
 Bayesian Belief Networks do not necessarily apply Bayesian models, though they are a way to utilize Bayes Theorem for domains with greater complexity beyond a 
 single posterior probability.  In this type of network, edges are directed and the structure is utilized in a single direction.  This is in contrast to undirected
 Hidden Markov Models that do not assume the order of aquisition of random variables.
 \end{document}
--- a/timesheet/reportUpdating.py
+++ b/timesheet/reportUpdating.py
@@ -36,7 +36,7 @@ def csv2Table(inFile):
        rows = list(reader)
    out = "\\begin{table}[h!]\n\\centering\n"
-    out += "\\begin{tabular}[t]{|" + " c | c | c | c | p{6cm} |}\n"
+    out += "\\begin{tabular}[t]{|" + " c | p{1.3cm} | c | c | p{6cm} |}\n"
    out += "\\hline\n"
    for row in rows:
--- a/timesheet/timesheet.csv
+++ b/timesheet/timesheet.csv
@@ -1,10 +1,25 @@
 Week,Date,Type,Duration (Hours),Description
 1,08/30,Advising Meetings,2,"Stat Review Content acknowledgement, Latex overview for reports"
 2,09/02,Reporting,3,"First applications of Latex for final report, created Timesheet System."
-2,09/02,Research,2,"Stat Review: Sample Space through Probability Density Functions"
+2,09/02,Research,2.5,"Stat Review: Sample Space through Probability Density Functions"
 2,09/06,Advising Meetings,1,"Research Review and exploration of PDF expected values and confidence intervals"
 3,09/14,Research,3,"Reading: Fooled by Randomness by Nassim N. Taleb"
 4,09/19,Research,2,"Producing Confidence Intervals"
-4,09/20,Research,1,"Statistical Inference and t-testing"
+4,09/20,Research,1.5,"Statistical Inference and t-testing"
 4,09/20,Advising Meetings,1,"Stat Review finalization, definition of reporting standard"
-5,09/23,Research,2,"Parametric and Non-parametric tests"
+5,09/23,Research,2.5,"Parametric and Non-parametric tests"
-6,10/03,Reporting,4,"Structuring stat review report"
+5,09/26,Research,3,"Kinsman's suggested reading: Prob and Stat by Charles Linn"
 6,09/25 - 09/30,Research,5,"Reading: Fooled by Randomness by Nassim N. Taleb"
 6,10/03,Reporting,4,"Structuring stat review writeup"
 6,10/04,Reporting,2,"Confidence Statistics writeup"
 6,10/04,Research,2.5,"Ludic Fallacy Reading: Skin in the Game by Nassim N. Taleb"
 6,10/04,Advising Meetings,1,"Report review and discussion on replacing deliverables"
 6,10/05,Application,1.5,"Hexagonal basis vectors"
 7,10/08,Research,2,"The Black Swan by Nassim Taleb"
 7,10/10,Reporting,2,"Epistemology Writeup"
 7,10/10,Research,1.5,"The Lindy Effect: The Lindy Way of Living - NYT"
 7,10/11,Reporting,3,"Moral Hazards, Outsized Impact, Lindy Effect in writeup"
 7,10/11,Advising Meetings,1,"Epistemology and Overview discussion, hex mapping"
 8,10/15,Research,3,"Bayes Belief Networks"
 8,10/16,Application,2.5,"Bayes visualizations and practice worksheets"
 8,10/16,Reporting,2,"Early Bayesian Statistics Report"
--- a/timesheet/timesheet.pdf
+++ b/timesheet/timesheet.pdf
--- a/timesheet/timesheet.tex
+++ b/timesheet/timesheet.tex
@@ -28,7 +28,7 @@
 % OPEN Timesheet
 \begin{table}[h!]
 \centering
-\begin{tabular}[t]{| c | c | c | c | p{6cm} |}
+\begin{tabular}[t]{| c | p{1.3cm} | c | c | p{6cm} |}
 \hline
 Week & Date & Type & Duration (Hours) & Description \\
 \hline
@@ -36,26 +36,57 @@ Week & Date & Type & Duration (Hours) & Description \\
 \hline
 2 & 09/02 & Reporting & 3 & First applications of Latex for final report, created Timesheet System. \\
 \hline
-2 & 09/02 & Research & 2 & Stat Review: Sample Space through Probability Density Functions \\
+2 & 09/02 & Research & 2.5 & Stat Review: Sample Space through Probability Density Functions \\
 \hline
 2 & 09/06 & Advising Meetings & 1 & Research Review and exploration of PDF expected values and confidence intervals \\
 \hline
 3 & 09/14 & Research & 3 & Reading: Fooled by Randomness by Nassim N. Taleb \\
 \hline
 4 & 09/19 & Research & 2 & Producing Confidence Intervals \\
 \hline
-4 & 09/20 & Research & 1 & Statistical Inference and t-testing \\
+4 & 09/20 & Research & 1.5 & Statistical Inference and t-testing \\
 \hline
 4 & 09/20 & Advising Meetings & 1 & Stat Review finalization, definition of reporting standard \\
 \hline
-5 & 09/23 & Research & 2 & Parametric and Non-parametric tests \\
+5 & 09/23 & Research & 2.5 & Parametric and Non-parametric tests \\
 \hline
-6 & 10/03 & Reporting & 4 & Structuring stat review report \\
+5 & 09/26 & Research & 3 & Kinsman's suggested reading: Prob and Stat by Charles Linn \\
 \hline
 6 & 09/25 - 09/30 & Research & 5 & Reading: Fooled by Randomness by Nassim N. Taleb \\
 \hline
 6 & 10/03 & Reporting & 4 & Structuring stat review writeup \\
 \hline
 6 & 10/04 & Reporting & 2 & Confidence Statistics writeup \\
 \hline
 6 & 10/04 & Research & 2.5 & Ludic Fallacy Reading: Skin in the Game by Nassim N. Taleb \\
 \hline
 6 & 10/04 & Advising Meetings & 1 & Report review and discussion on replacing deliverables \\
 \hline
 6 & 10/05 & Application & 1.5 & Hexagonal basis vectors \\
 \hline
 7 & 10/08 & Research & 2 & The Black Swan by Nassim Taleb \\
 \hline
 7 & 10/10 & Reporting & 2 & Epistemology Writeup \\
 \hline
 7 & 10/10 & Research & 1.5 & The Lindy Effect: The Lindy Way of Living - NYT \\
 \hline
 7 & 10/11 & Reporting & 3 & Moral Hazards, Outsized Impact, Lindy Effect in writeup \\
 \hline
 7 & 10/11 & Advising Meetings & 1 & Epistemology and Overview discussion, hex mapping \\
 \hline
 8 & 10/15 & Research & 3 & Bayes Belief Networks \\
 \hline
 8 & 10/16 & Application & 2.5 & Bayes visualizations and practice worksheets \\
 \hline
 8 & 10/16 & Reporting & 2 & Early Bayesian Statistics Report \\
 \hline
 \end{tabular}
 \end{table}
-\noindent Hours for Advising Meetings: 4\\
+\noindent Hours for Advising Meetings: 6.0\\
-Hours for Reporting: 7\\
+Hours for Application: 4.0\\
-Hours for Research: 7\\
+Hours for Reporting: 16.0\\
-\textbf{Total Hours: 18}\\
+Hours for Research: 28.5\\
 \textbf{Total Hours: 54.5}\\
 % CLOSE Timesheet
 \end{document}