Markov, but not just the timesheet

2026-04-11 10:07:12 -05:00 · 2024-11-15 18:16:03 -05:00
parent c086910ff5
commit a7b3e46f56
2 changed files with 234 additions and 8 deletions
--- a/report/report.pdf
+++ b/report/report.pdf
--- a/report/report.tex
+++ b/report/report.tex
@@ -295,6 +295,18 @@ probability can only be calculated by determining how likely it is that the frie
 the odds that such a friend may also be willing to throw limbs at his car so as to maintain their ever-reliable facade.  If one also considers the possibility 
 that Shafer's friends mistakenly believed a limb fell on his car, this uncertainty must also be combined with the evidence for the most accurate picture.
 \subsubsection{Minority Rule through Renormalization}
 One way that details about a sample can be supressed is through minority rule, where analyses is skewed by the influence of a small subsection of the population 
 imposing attributes onto a pliable, but larger subsection of the population.  Often used in social sciences and asymmetric warfare, the stubbornness of a handful of 
 people, say, those with a demanding preference for organic foods, requires the surrounding environment to adapt.  Most people who do not eat organic but would not 
 object if it was all that was offered.  Thus, a family with a single person with a dietary preference can flip the entire kitchen to fit that preference.  This 
 process is called renormalization and it runs counter to the observations of outsiders that might infer that the whole family prefers organic foods.  
 Scaled upwards, the renormalization effect might then apply itself to a cookout between families who acknowledge one family has a dietary preference.  That might 
 then renormalize the entire community, resulting in local grocery store offerings being near-exclusive to the dietary preference of a remarkably small portion of the 
 community.  If a data scientist then infers from the offerings of this grocery store the dietary preferences of the community, they would be inclined to believe that 
 the actual minority is not just a majority, but a requirement amongst the population.  In this sense, tolerance for intolerance begets intolerance.
 \subsubsection{Methodology Considerations}
 I have taken 10134023 instances of the last 40 years, during all of which Obama has been alive.  Therefore I can say with a high degree of certainty that Obama is 
 immortal.
@@ -485,7 +497,7 @@ that the tests partially measure the same thing, as would have occured in a Naiv
 \subsubsection{Markov Chains}
-Markov Chains are a form of probabilistic automaton where, the likelihood of transitioning to a new state depends solely on the current state, with no memory of prior 
+Markov Chains are a form of probabilistic automaton where the likelihood of transitioning to a new state depends solely on the current state with no memory of prior 
 states.  For example\footnote{example sourced from:\\\url{https://towardsdatascience.com/introduction-to-markov-chains-50da3645a50d}}, suppose a weather prediction 
 program wants to know whether tomorrow will be a sunny or cloudy day, based on the current weather.  Using the current weather as a state, the program identifies that 
 there is a 10\% chance of a sunny day transitioning into a cloudy day and a 50\% chance that a cloudy day transitions into a sunny day:
@@ -506,9 +518,11 @@ there is a 10\% chance of a sunny day transitioning into a cloudy day and a 50\%
 \end{center}
 Note that there is no information preserved between steps.  Markov Chains are memoryless, so any information that must be available to them must be expressed as the 
-state, such as the sunny and cloudy states in the example above.  One benefit of such a straightforward structure is that it enables easy calculation of the 
+state, such as the sunny and cloudy states in the example above.  Accemically, this is called the \textbf{Markov Assumption}, though it is vocabulary that can easily 
-probabilities of reaching a state k-steps from the current position.  By expressing the chain as a transition matrix where rows represent the current state, the 
+be explained with few additional words and won't be used for the rest of this paper.  One benefit of such a straightforward structure is that it enables easy 
-column represents the next state, and each cell contains the probability of the state moving from the column state to the row state, we get a 1-step transition matrix:
+calculation of the probabilities of reaching a state k-steps from the current position.  By expressing the chain as a transition matrix where rows represent the 
 current state, the column represents the next state, and each cell contains the probability of the state moving from the column state to the row state, we get a 
 1-step transition matrix:
 \[
 \begin{pmatrix}
@@ -569,8 +583,147 @@ looks like this:
 \end{pmatrix}
 \]
-\subsubsection{Hidden Markov Models}
+\subsubsection{Hidden Markov Models}\label{HMMs}
-maybe add notes on mixed
+In contrast to the visible Markov Models above, Hidden Markov Models cannot observe the states within the model.  The benefit to using such a model is that 
 observations of occurrences can use alogirthms such as the Viterbi Algorithm to determine the probability of a sequence of observations and estimate which state is 
 active in a given instance.  These results extrapolating process to the result is reminiscent of inverse problems and many explanatory uses of data science, such as 
 in finance where, with the benefit of hindsight, analysts work to determine why events unfolded the way they did.
 In addition to states, initial state probabilities, and transition probabilities, Hidden Markov Models also utilize observations, and emission probabilities, or the 
 probability of an observation given a transition from state a to b. Using the earlier example where states represent either a sunny or cloudy day, an observation 
 liklihood matrix can be created for a weather sensor that can only determine if the ground is wet.  On a cloudy day there is a probability of rain and thus a high 
 probability of the ground being wet, whereas a sunny day would not nearly as often be triggered by dew or sensor tampering:
 \[
    \begin{array}{c c} 
        & \begin{array}{ccc} % Align column labels above the matrix
            \text{dry} & \text{wet}
        \end{array} \\ % End the first row (labels) with double backslash
        \begin{array}{c} % Row labels
            \text{Sunny} \\ 
            \text{Cloudy} \\
        \end{array} & 
        \begin{bmatrix} % Matrix with brackets
            .95 & .05 \\
            .6 & .4 \\
        \end{bmatrix}
    \end{array}
 \]
 Thus, an observation sequence may look like this:
 \[
 [\text{Dry, Dry, Wet}]
 \]
 In this case, it can be confidently assumed that the wet signal is representative of a rainy, cloudy day.  In contrast, we can only be moderately confident that the 
 two dry days leading up to it were sunny days.  Intuitively, it is most likely that there were two sunny days followed by a rainy day.  By multiplying the probability 
 of observation to the transformation to the potential state, the probability of occurrence is revealed.  Below, we assume a 50-50 chance of initialization at a sunny 
 or cloudy day:
 \begin{center}
    Three consecutive sunny days:
    \[(.5 * .95) * (.9 * .95) * (.9 * .05) \approx 0.01828 \]
    Three consecutive cloudy days:
    \[(.5 * .6) * (.5 * .6) * (.5 * .4) = 0.018 \]
    Sunny, sunny, cloudy:
    \[(.5 * .95) * (.9 * .95) * (.1 * .4) \approx 0.01625 \]
 \end{center}
 Interestingly, the calculation reveals that it is actually more probable that there was an unusual wet third day during a sunny streak than for there to have been 
 a cloudy day following two sunny days.\footnote{I say interesting because I forgot how low I set the probability of sunny to cloudy and wholly expected the intuitive 
 sun-sun-cloud answer to prove accurate.  Math moment.}
 Brief sidenote, since the probability initial state is not known, the probability of initalization at state \(n\) is expressed in calculations as \(\pi_n\).  I will 
 not use this notation in this report because I think it is confusing and somewhat ridiculous to have mathematical notation with as ubiquitous and universally constant 
 a meaning as \(\pi\) be addressed for something that has no relation to the constant.  Whatever convention made this determination is seriously damaging the 
 accessibility of mathematics for anybody shy of a walking computational index.
 \subsubsection{Viterbi Algorithm}
 While it is feasible to calculate the probabilities for each possible route to a series of observations, such a process produces an exponential time complexity.  
 With each state change, the number of paths to keep track of grows exponentially, which in practical terms means countless threads on each state separated only by 
 the history of how they got there.  Enter the Viterbi Algorithm, which reduces the effect of a step (or, as in our example, a new day) from an exponential 
 relationship ( \(O(N^T)\) ) to a flat multiple ( \(O(N^2 T)\) ).  This is possible because the Viterbi Algorithm creates partial solutions by eliminating all but the 
 most optimal branch to reach the next state instead of recomputing each exit from a state for each entry.  If a route is deemed improbable, it will not be considered 
 the next time the same observation sequence occurs at that state.
 More intuitively, consider that there are multiple ways to reach a given state in 1 step.  Once each path's probability is computed, you only need to retain the 
 highest probability path to that state and the next step will only require calculation from that state once.\footnote{The mathematical notation to describe this 
 algorithm is criminally challenging to parse.  I want to acknowledge this video for being the only one of its kind that did not rely on the notation: 
 \url{https://www.youtube.com/watch?v=6JVqutwtzmo}}  Consider the following graphic rendition of each possible 3-day sequence of sunny vs cloudy:
 \begin{center}
    \begin{tikzpicture}[shorten >=1pt, node distance=3cm, on grid, auto]
        \node[state] (Sunny1) {Sunny};
        \node[state, below=of Sunny1] (Cloudy1) {Cloudy};
        \node[state, right=of Sunny1] (Sunny2) {Sunny};
        \node[state, below=of Sunny2] (Cloudy2) {Cloudy};
        \node[state, right=of Sunny2] (Sunny3) {Sunny};
        \node[state, below=of Sunny3] (Cloudy3) {Cloudy};
        \node[above=of Sunny1, yshift=-1.5cm]{Day 1};
        \node[above=of Sunny2, yshift=-1.5cm]{Day 2};
        \node[above=of Sunny3, yshift=-1.5cm]{Day 3};
        \path[->]
        (Sunny1) edge node {} (Sunny2)
                edge node {} (Cloudy2)
        (Cloudy1) edge node {} (Sunny2)
                edge node {} (Cloudy2)
        ([yshift=1mm] Sunny2.east) edge[->] node {} ([yshift=1mm] Sunny3.west)
        ([yshift=-1mm] Sunny2.east) edge[->] node {} ([yshift=-1mm] Sunny3.west)
        ([yshift=1mm] Sunny2.east) edge[->] node {} ([yshift=1mm] Cloudy3.west)
        ([yshift=-1mm] Sunny2.east) edge[->] node {} ([yshift=-1mm] Cloudy3.west)
        ([yshift=1mm] Cloudy2.east) edge[->] node {} ([yshift=1mm] Sunny3.west)
        ([yshift=-1mm] Cloudy2.east) edge[->] node {} ([yshift=-1mm] Sunny3.west)
        ([yshift=1mm] Cloudy2.east) edge[->] node {} ([yshift=1mm] Cloudy3.west)
        ([yshift=-1mm] Cloudy2.east) edge[->] node {} ([yshift=-1mm] Cloudy3.west);
    \end{tikzpicture}
 \end{center}
 Notice that there are two arrows from each day 2 state to each day 3 state because there two paths were created to reach each of the day 2 states.  If there was a 
 fourth day depicted, there would be 4 calculations from each day 3 state to each day 4 state.  To prevent this, the Viterbi Algorithm only preserves the most likely 
 path to each node.  For instance, there are two paths to a sunny day on day 2.  Either the first day was sunny and it stayed sunny, or the first day was cloudy but 
 transitioned to sunny the next day.  Using the same \([\text{Dry, Dry, Wet}]\) observation sequence as before, the probabilities of these paths occurring can be 
 calculated:
 \begin{center}
    Two consecutive sunny days:
    \[(.5 * .95) * (.9 * .95) = 0.406125 \]
    Sunny, cloudy:
    \[(.5 * .6) * (.5 * .95) = 0.1425 \]
 \end{center}
 Hence, we can eliminate the \([\text{Cloudy, Sunny}]\) starting sequence from the most probable sequence of steps given the observations.  Doing the same thing 
 for the rest of the visualization leaves fewer arrows and therefore fewer calculations:
 \begin{center}
    \begin{tikzpicture}[shorten >=1pt, node distance=3cm, on grid, auto]
        \node[state] (Sunny1) {Sunny};
        \node[state, below=of Sunny1] (Cloudy1) {Cloudy};
        \node[state, right=of Sunny1] (Sunny2) {Sunny};
        \node[state, below=of Sunny2] (Cloudy2) {Cloudy};
        \node[state, right=of Sunny2] (Sunny3) {Sunny};
        \node[state, below=of Sunny3] (Cloudy3) {Cloudy};
        \node[above=of Sunny1, yshift=-1.5cm]{Day 1};
        \node[above=of Sunny2, yshift=-1.5cm]{Day 2};
        \node[above=of Sunny3, yshift=-1.5cm]{Day 3};
        \path[->]
        (Sunny1) edge node {} (Sunny2)
        (Cloudy1) edge node {} (Cloudy2)
        (Sunny2) edge node {} (Sunny3)
        (Cloudy2) edge node {} (Cloudy3);
        \path[->, draw=red]
        (Sunny1) edge node[midway] {\textbf{x}} (Cloudy2)
        (Cloudy1) edge node[midway] {\textbf{x}} (Sunny2)
        (Sunny2) edge node[midway] {\textbf{x}} (Cloudy3)
        (Cloudy2) edge node[midway] {\textbf{x}} (Sunny3);
    \end{tikzpicture}
 \end{center}
 With only two sequences remaining, the final comparison needs only to determine if it is more likely for there to have been three consecutive sunny days or three 
 consecutive cloudy days, which was already done in the Hidden Markov Model section (\ref{HMMs}).
 \newpage
 \subsection{Unit 5: Monte Carlo Simulations}
@@ -596,12 +749,85 @@ Found this later, it's sort of similar. \url{https://www.bayesserver.com/}
 Even better than their jank bayesian belief network I may be able to make mixed bayesian/markov chain models.  This is a big project.
-\subsection{Cost-Benefit Analysis of Asychronous Education}
+\subsection{Cost-Benefit Analysis of Remote Education}
 This section covers a calculation I devised to make me feel better about my life decisions.  The data is based on implicit guesswork and, while I will be taking it 
-seriously for my decision to do either the online or on-campus RIT Data Science Masters Program, it should not be taken seriously as a probabilistic model.  
+somewhat seriously for my decision to do either the online or on-campus RIT Data Science Masters Program, it should not be taken seriously as a probabilistic model.  
 Since there is no framework for making a subjective decision weighting the potential benefits of on-campus life with the value of entering the workforce 18 months 
 sooner, I decided to make one.  Inshallah I shall reach my true potential and fulfill destiny.
 \subsubsection{Selecting and Creating Key Metrics}
 Since both programs result in a Data Science M.S. degree (albeit under the school of Software Engineering for on-campus versus the school of information for online), 
 the functional equivalence of the resulting certificate of completion is an effective isolator of potential long-term ramifications in career path that might otherwise 
 be dictated by hiring processes that favor one degree over the other.  Therefore, this analysis is justified in focusing only on events occurring during my extended 
 education.  I have selected two calculated features\footnote{features that I do not intend to calculate on the basis that it is impossible without a crystal ball and 
 knowledge of fortune telling - a cursed art that has been forbidden by the council for centuries.} that are important to determining the utility of 
 potential events from each masters program.  
 The generalized feature I've selected is serendipity\footnote{Read more about this definition of serendipity in \textit{Where Good Ideas Come From: The Natural 
 History of Innovation} by Steven Johnson}: the potential for the spontaneous formulation of creative genius brought about by the random collision of ideas - the 
 proverbial cafe of intellectuals where overheard conversations turn into incredible revelations.  The on-campus program excels in this category because it extends 
 my stay in the academically diverse setting of Rochester Institute of Technology's main campus, potentially enabling interdisciplinary connections and research 
 opportunities.  It also would grant me more time to get involved in the Simone Center for Innovation and Entrepreneurship which is an enticing hub for startups that 
 I can see myself becoming a key part of.  In contrast, the online program offers me few opportunities to connect within RIT while opening the door to starting a 
 career in person sooner, which holds potential for intrapreneurship and a more directed interdisciplinary relationship.  I acknowledge the magnitude of such 
 opportunities to be lesser, but more probable, especially if I change jobs more frequently.
 When I was first choosing features I wanted to include a second metric to capture a level of character growth and mental health as a reflection of the impact of being 
 online and not being face-to-face with other people.  In doing so I'd be modeling real-life variables that most would overlook. 
 Digging into it I realized I'd have to derive it from the magnitude and probabilities of social advantages of each program.  
 The community fostered, the friends not made.  I can't bring myself to even make up numbers for that in a goof napkin-math formula.  
 Measuring covariance between these two features just feels disgusting.  Instead, I'm going to negate the whole variable with this assumption about finding something 
 else to do with my life outside of work:
 \begin{center}
    \textit{The negative social effects of online program isolation are equal to and canceled out by the personal growth derived from the extra effort to find 
    'the third place' \footnote{First and second places are home and work.  Read more at: \url{https://en.wikipedia.org/wiki/Third_place}} seeded by the frustration
    towards myself for puttimg myself in this position.}
 \end{center}
 \paragraph{Creating PMFs}
 Let's create probability mass functions for our feature in each program to subjectively measure potential:
 Let the probability of magnitude \(X\) serendipity on the campus program and the online program as \(P(X_c)\) and \(P(X_o)\) respectively.
 The on-campus program has advantages in serendipity, but while events may be an order of magnitude more impactful, I've already been on campus for three and a half 
 years and it feels highly unlikely that I will make sufficient changes to my routines to grant me more than a marginal probability of a serendipitous event occurring
 \begin{equation*}
    P(A_c) = 
    \begin{cases}
        .8\qquad\text{if }&X=0\\
        .105&X=1\\
        .045&X=2\\
        .025&X=3\\
        .0125&X=4\\
        .009&X=5\\
        .0035&X=6\\
        0&\text{Otherwise}
    \end{cases}
 \end{equation*}
 **graph**
 The online program wields greater chances of serendipity by placing me in more unique environments by means of starting my career sooner, hopefully giving me more 
 time to utilize what remains of my ambition before it crumbles with age and routine.  There may be less of an impact for a serendipitous event when experiencing it 
 remotely or within a corporate structure, but what does a foolish little boy still in school know about the passion inbued by one's own accidental discoveries?
 \begin{equation*}
    P(A_o) = 
    \begin{cases}
        .6\qquad\text{if }&X=0\\
        .225&X=1\\
        .115&X=2\\
        .045&X=3\\
        .0087&X=4\\
        .0043&X=5\\
        .002&X=6\\
        0&\text{Otherwise}
    \end{cases}
 \end{equation*}
 **graph**
 with archaic knowledge imbued by Dr. Pepper flowing through my veins, I have selected \(y= 3x^2 - 2y\) as the equation for covariance.
 \end{document}