diff --git a/report/report.pdf b/report/report.pdf index da2cf05..a8fc1ae 100644 Binary files a/report/report.pdf and b/report/report.pdf differ diff --git a/report/report.tex b/report/report.tex index 4bd5f91..48dd965 100644 --- a/report/report.tex +++ b/report/report.tex @@ -4,6 +4,8 @@ \usepackage{amsmath} \usepackage{amssymb} \usepackage[a4paper, total={6in, 10in}]{geometry} +\usepackage{setspace} +\setstretch{1.25} \hyphenpenalty 1000 \begin{document} @@ -33,7 +35,24 @@ \newpage % Begin report \section{Objective} -yada yada yah I started this independent study for my own selfish gain +The educational focus of Implementations of Probability Theory surrounds the application of data +models that produce non-deterministic insights through probabilistic methodology. By pursuing this +study I hope to gain a deeper understanding of how to apply data in risk calculation for mitigation +scenarios as they appear in real life, rather than the experimental lab conditions that enable algorithmic +certainty. + +In contrast to the path of black-box artificial intelligence and algorithms taught in \textbf{CSCI 335: Machine Learning}, this study is tailored to methods +designed to produce confidence levels for uncertain events using certain terms, leveraging logical, +traceable, and definite, calculations. Current course offerings in the realm of data science focus largely on +the storing and management of data, and it is noted that the cluster of data science was until very recently +under the branding of data management. Implementations of Probability Theory is intended to extend +learnings in previous courses, notably \textbf{CSCI 420: Principles of Data Mining}, for more advanced algorithms +used at the intersection of data and computing after the preprocessing stage. + +After beginning this study the intended deliverable outline was determined to be technically implausible and has been replaced with +demonstrations of applied algorithms. Taking inspiration from the retinal mosaic as displayed in \textbf{CSCI 431: Intro to Computer Vision} +and discussion in \textbf{IGME 589: Computational Creativity and Algorithmic Art} on the appearance and nature of randomness in graphics, I hope to create +a program that can determine the liklihood that randomly distributed colors on a hexagonal grid appear as they do in an image. \newpage \section{Units} @@ -155,4 +174,139 @@ To calculate standard error, kys. Statistical Inference is any data analysis to draw conclusions from a sample to make assertions about the population. Methods include estimation via averages and confidence intervals, and hypothesis testing, which attempts to invalidate (never \textit{validate}) a hypothesis. +\newpage +\subsection{Unit 2: Probabilistic Theories and Epistemology} +When developing probabilistic models it is vital to use domain expertise to expose the product to the full range of external variables that would be expected +of a model applied to the real world. Without an appropriate understanding of both the limitations in research procedures and the true value of the data collected, +the integrity of the model becomes inherently compromised. + +As data scientists, we are uniquely at risk of falling for this trap because it is hard to fully grasp domain expertise when the nature of data science +in a business setting frequently means consulting for many separate projects with a collectively massive scope. Of equal consideration, it is also easy +to assume that the sophistication of our tools overrides imperfections in the data, in spite of mantras like 'Garbage In, Garbage Out'. + +In this unit I explored some common fallacies and assumptions held by analysts who may not fully grasp the content that they work with, +nor the problems they intend to solve. This required extensive research that I found was best digested in the form of books whose chapters chronicle multiple +examples of a given principle. As such, the reading was not confined to just the timeslot designated for this unit. Research started during the months leading up +to the start of the semester\footnote{Only research during the semester was logged in the timesheet} and have continued through the independent study. This structure was particularly helpful to pull me back and gain perspective of what +my goal was when I was knee-deep in feature construction and model formulation. + +\subsubsection{Moral Hazards and The Bob Rubin Trade} +Picking pennies in front of a steamroller. +When studying the effectiveness of a model the scope of review must capture the entire range of the sample space. Discarding black swans that don't impact +the client does not mean the results will not reflect on the client for an oversight. There is therefore a question of obligation for data scientists to include +flags for significant events in reality that do not effect the proposed course of action to the client. + +The 2009 recession, attributed to the collapse of the housing market bubble, is the most common example of a moral hazard because the displacement of risk from +banks who were federally required to give subprime loans to the taxpayer meant that banks could profit from subprime loans but would not be harmed when the inevitable +occurred. In popular media, the housing bubble bursting is attributed to the banks where those in the industry passed off the event as something that nobody could +have forseen.\footnote{For instance, in the 2015 movie \textit{The Big Short}, only a few savvy traders who bothered to look into the details find that banks had, +in their ignorance, built the bundled mortgages on an unstable foundation.} In reality, banks only ignored a probablistic eventuality because their models did not +need to account for such an event. + +Most emphasize the problems with risk transferrence when creating models. For this study's purposes, the important learning is that probablistic models should not +drop evaluations as soon as an event leaves the scope of the immediate client. + +\subsubsection{Ignoring Improbable Outliers with Outsized Impact} +In machine learning it is common for algorithms to drop the most extreme (or a random selection of) datapoints to avoid overfitting and errors in data collection. +One issue with the current implementation of this procedure is that it is often done blindly, ignorant of information that these outliers may relay. For instance, +in a selection of 300 water samples from a stream, all but a few show a normal amount of oxygen in the stream. A citizen scientist may discount the remaining pockets +as a statistical implausibility that is most likely indicative of a failure in sample testing and drop the most extreme 5\% of datapoints. +However, if these few pockets show a complete disruption of the dissolution process, the vast majority of aquatic life in the stream will eventually pass through +these pockets without oxygen and die, resulting in an outsized impact from just a few sources. + +Nassim Taleb in \textit{Fooled By Randomness} describes this event with an analogy to Russian Roulette: If there was a 5/6 chance of winning a million dollars and a +1/6 chance of killing yourself, many people would at least hesitate before pulling the trigger. But what if the barrel is 10,000 rounds and it was only a +1/10,000 chance of harm? In this case, many less-than-rational actors use the game repeatedly to acquire wealth indefinitely, forgetting or even outright ignorant +that eventually the unlikely, or, as the actor would see it, the unthinkable, happens and all of the gains are completely negated. + +\subsubsection{Fooled By Randomness} +May justify its own subsection since the others acknowledge small probabilities whereas this is outright randomness. + +\subsubsection{Lindy Effect} +"For the perishable, every additional day in its life translates into a shorter additional life expectancy. +For the nonperishable, every additional day may imply a longer life expectancy." +A tool that is proven is more likely to stand the test of time than a new tool replacing it since it is unproven. +"The robustness of an item is proportional to its life!" + +"Inaccurate science\ldots is constantly being published. The Lindy-conscious consumer of scientific data will take seriously only +information that has held up over a period of time."\footnote{\url{https://www.nytimes.com/2021/06/17/style/lindy.html}} + +\subsubsection{Decision Theory} +Decision theory is the study of how people make decisions with uncertain information. There are two main branches of decision theory: +\subsubsection*{Normative/Rational Decision Theory} +This branch studies how people \textit{should} make decisions. In problems with other actors, as in game theory, it is assumed that all other actors will also +act with perfect rationality, allowing for precise calculation of the actions of all of the others and their expected utility to the agent. +\subsubsection*{Descriptive Decision Theory} +This branch studies how people actually make decisions which includes factors such as psychological and emotional biases. + +\subsubsection{Info Gap Decisions} +In info gap decision theory there is not enough information to assign probabilities to events and the goal is to select a course of action that is robust in the +face of uncertainty. Where decision theory can predict expectations in irrationality to determine expected values, info gap decisions approximate the range of +probabilities and weight them to estimate expected value. In essence, it applies probabilities to probabilities, adding an additional layer to insulate calculations +from a lack of data or lack of understanding of a topic. + +\subsubsection{Methodology Considerations} +Given I have taken 10134023 instances of the last 40 years, all of which Obama has been alive, I can say with a high degree of certainty that Obama is immortal. + +An event never occurring in history does not discount its possiblity of occurring in the future. Similarly, events that may have been impossible in the past +are not necessarily impossible in the future. +Also, psychology. Someone who knows they are being studied will act differently than someone who isn't being studied so models will be inaccurate. + +\newpage +\subsection{Unit 3: Bayesian Statistics} +This unit was deliberately separated from statistical review due to the percieved complexity of the topic and the magnitude of usage in recent data science +breakthroughs. Bayes Theorem is a part of the cirriculum for both \textbf{MATH 351 - Probability and Statistics} and \textbf{CSCI 420 - Principles of Data Mining}. +However, as both approached the topic from different perspectives and while neither solidified my personal confidence in its use, I chose to take extra time to learn +this important topic in my own way. + +It has been said that statistics does not come naturally to the human brain, hence statistics is, by mathematical standards, a +young discipline. Resulting research on Bayesian statistics has led me to the conclusion that the opposite may be true - Bayes Theorem is quite intuitive, but +its discipline has not had the time to crystallize best practices for instructing it. For instance, updating one's beliefs to compare probabilities with the +number of documented occurrences is frequently used in philosophical discussion in the form of explanations that subsets with high liklihood of fufilling terms +are valid classifications even when the subset size results in overall fufilled terms to be infrequently categorized as the proposed subset. Most people understand +these expressions but, when shown a table and how to calculate those ratios, the content enters the realm of collegiate instruction. + +\subsubsection{Bayes Theorem} + +The equation for Bayes Theorem is as follows: + +\[ +P(A|E) = \frac{P(A) * P(E|A)}{P(A) * P(E|A) + (1 - P(A)) * P(E|\neg A)} +\] + +This formula appears more complex as it is. The denominator, while directly translating to "The probability of A times the probability of event E occuring in A +divided by the probability of A times the probability of event E occuring in A plus the probability of not A times the probability of E occuring in not A" +can be more easily expressed simply as \(P(E)\) or the probability of event E occuring. + +By utilizing venacular more familiar to everyday life, Bayes Theorem can be translated into: + +\[ +\text{P(occurence came from category)} = \frac{\text{\# of occurences from category}}{\text{total \# of occurences}} +\] + +Finally, this equation is updated to replace descriptions with technical terms: + +\[ +\text{Posterior Probability} = \frac{\text{prior} * \text{likelihood}}{\text{Evidence}} +\] + +Even this equation can be misconstrued as a number of arrangements of ratios involving total occurrences from a category or non-occurrences from outside +of the category so as a final demonstration, the sample space will be visualized geometrically +\footnote{Concept credit to 3Blue1Brown on Youtube, this video is what finally clarified in my mind what the equation behind Bayes Theorem meant.\\ +\url{https://www.youtube.com/watch?v=HZGCoVF3YvM}} as a 1 unit by 1 unit square. + + +\subsubsection{Bayesian Updating} +Bayesian Updating is another term that has been added to buzzword vocabulary to describe a process that isn't directly related to Bayesian Statistics but appears +to have been rediscovered by academia through study of applied Bayes Theorem. In essence, Bayesian Updating simply states that observed occurrences should not +override previous evidence and that it should instead be added to it in equal weight (equal value being a naive assumption). This evidence updating makes +applications of Bayes Theory calculate posterior probabilities continuously as new information enters the system rather than a calculation that is only done once. + + +\subsubsection{Bayesian Belief Networks} +Bayesian Belief Networks are probablistic graphical models that preserve conditional dependence between random variables. In spite of its name, +Bayesian Belief Networks do not necessarily apply Bayesian models, though they are a way to utilize Bayes Theorem for domains with greater complexity beyond a +single posterior probability. In this type of network, edges are directed and the structure is utilized in a single direction. This is in contrast to undirected +Hidden Markov Models that do not assume the order of aquisition of random variables. + \end{document} \ No newline at end of file diff --git a/timesheet/reportUpdating.py b/timesheet/reportUpdating.py index a6100fe..ace2c7b 100644 --- a/timesheet/reportUpdating.py +++ b/timesheet/reportUpdating.py @@ -36,7 +36,7 @@ def csv2Table(inFile): rows = list(reader) out = "\\begin{table}[h!]\n\\centering\n" - out += "\\begin{tabular}[t]{|" + " c | c | c | c | p{6cm} |}\n" + out += "\\begin{tabular}[t]{|" + " c | p{1.3cm} | c | c | p{6cm} |}\n" out += "\\hline\n" for row in rows: diff --git a/timesheet/timesheet.csv b/timesheet/timesheet.csv index a2c8926..d709162 100644 --- a/timesheet/timesheet.csv +++ b/timesheet/timesheet.csv @@ -1,10 +1,25 @@ Week,Date,Type,Duration (Hours),Description 1,08/30,Advising Meetings,2,"Stat Review Content acknowledgement, Latex overview for reports" 2,09/02,Reporting,3,"First applications of Latex for final report, created Timesheet System." -2,09/02,Research,2,"Stat Review: Sample Space through Probability Density Functions" +2,09/02,Research,2.5,"Stat Review: Sample Space through Probability Density Functions" 2,09/06,Advising Meetings,1,"Research Review and exploration of PDF expected values and confidence intervals" +3,09/14,Research,3,"Reading: Fooled by Randomness by Nassim N. Taleb" 4,09/19,Research,2,"Producing Confidence Intervals" -4,09/20,Research,1,"Statistical Inference and t-testing" +4,09/20,Research,1.5,"Statistical Inference and t-testing" 4,09/20,Advising Meetings,1,"Stat Review finalization, definition of reporting standard" -5,09/23,Research,2,"Parametric and Non-parametric tests" -6,10/03,Reporting,4,"Structuring stat review report" \ No newline at end of file +5,09/23,Research,2.5,"Parametric and Non-parametric tests" +5,09/26,Research,3,"Kinsman's suggested reading: Prob and Stat by Charles Linn" +6,09/25 - 09/30,Research,5,"Reading: Fooled by Randomness by Nassim N. Taleb" +6,10/03,Reporting,4,"Structuring stat review writeup" +6,10/04,Reporting,2,"Confidence Statistics writeup" +6,10/04,Research,2.5,"Ludic Fallacy Reading: Skin in the Game by Nassim N. Taleb" +6,10/04,Advising Meetings,1,"Report review and discussion on replacing deliverables" +6,10/05,Application,1.5,"Hexagonal basis vectors" +7,10/08,Research,2,"The Black Swan by Nassim Taleb" +7,10/10,Reporting,2,"Epistemology Writeup" +7,10/10,Research,1.5,"The Lindy Effect: The Lindy Way of Living - NYT" +7,10/11,Reporting,3,"Moral Hazards, Outsized Impact, Lindy Effect in writeup" +7,10/11,Advising Meetings,1,"Epistemology and Overview discussion, hex mapping" +8,10/15,Research,3,"Bayes Belief Networks" +8,10/16,Application,2.5,"Bayes visualizations and practice worksheets" +8,10/16,Reporting,2,"Early Bayesian Statistics Report" \ No newline at end of file diff --git a/timesheet/timesheet.pdf b/timesheet/timesheet.pdf index e520d54..168777c 100644 Binary files a/timesheet/timesheet.pdf and b/timesheet/timesheet.pdf differ diff --git a/timesheet/timesheet.tex b/timesheet/timesheet.tex index d13cdf9..f40f11a 100644 --- a/timesheet/timesheet.tex +++ b/timesheet/timesheet.tex @@ -28,7 +28,7 @@ % OPEN Timesheet \begin{table}[h!] \centering -\begin{tabular}[t]{| c | c | c | c | p{6cm} |} +\begin{tabular}[t]{| c | p{1.3cm} | c | c | p{6cm} |} \hline Week & Date & Type & Duration (Hours) & Description \\ \hline @@ -36,26 +36,57 @@ Week & Date & Type & Duration (Hours) & Description \\ \hline 2 & 09/02 & Reporting & 3 & First applications of Latex for final report, created Timesheet System. \\ \hline -2 & 09/02 & Research & 2 & Stat Review: Sample Space through Probability Density Functions \\ +2 & 09/02 & Research & 2.5 & Stat Review: Sample Space through Probability Density Functions \\ \hline 2 & 09/06 & Advising Meetings & 1 & Research Review and exploration of PDF expected values and confidence intervals \\ \hline +3 & 09/14 & Research & 3 & Reading: Fooled by Randomness by Nassim N. Taleb \\ +\hline 4 & 09/19 & Research & 2 & Producing Confidence Intervals \\ \hline -4 & 09/20 & Research & 1 & Statistical Inference and t-testing \\ +4 & 09/20 & Research & 1.5 & Statistical Inference and t-testing \\ \hline 4 & 09/20 & Advising Meetings & 1 & Stat Review finalization, definition of reporting standard \\ \hline -5 & 09/23 & Research & 2 & Parametric and Non-parametric tests \\ +5 & 09/23 & Research & 2.5 & Parametric and Non-parametric tests \\ \hline -6 & 10/03 & Reporting & 4 & Structuring stat review report \\ +5 & 09/26 & Research & 3 & Kinsman's suggested reading: Prob and Stat by Charles Linn \\ +\hline +6 & 09/25 - 09/30 & Research & 5 & Reading: Fooled by Randomness by Nassim N. Taleb \\ +\hline +6 & 10/03 & Reporting & 4 & Structuring stat review writeup \\ +\hline +6 & 10/04 & Reporting & 2 & Confidence Statistics writeup \\ +\hline +6 & 10/04 & Research & 2.5 & Ludic Fallacy Reading: Skin in the Game by Nassim N. Taleb \\ +\hline +6 & 10/04 & Advising Meetings & 1 & Report review and discussion on replacing deliverables \\ +\hline +6 & 10/05 & Application & 1.5 & Hexagonal basis vectors \\ +\hline +7 & 10/08 & Research & 2 & The Black Swan by Nassim Taleb \\ +\hline +7 & 10/10 & Reporting & 2 & Epistemology Writeup \\ +\hline +7 & 10/10 & Research & 1.5 & The Lindy Effect: The Lindy Way of Living - NYT \\ +\hline +7 & 10/11 & Reporting & 3 & Moral Hazards, Outsized Impact, Lindy Effect in writeup \\ +\hline +7 & 10/11 & Advising Meetings & 1 & Epistemology and Overview discussion, hex mapping \\ +\hline +8 & 10/15 & Research & 3 & Bayes Belief Networks \\ +\hline +8 & 10/16 & Application & 2.5 & Bayes visualizations and practice worksheets \\ +\hline +8 & 10/16 & Reporting & 2 & Early Bayesian Statistics Report \\ \hline \end{tabular} \end{table} -\noindent Hours for Advising Meetings: 4\\ -Hours for Reporting: 7\\ -Hours for Research: 7\\ -\textbf{Total Hours: 18}\\ +\noindent Hours for Advising Meetings: 6.0\\ +Hours for Application: 4.0\\ +Hours for Reporting: 16.0\\ +Hours for Research: 28.5\\ +\textbf{Total Hours: 54.5}\\ % CLOSE Timesheet \end{document} \ No newline at end of file