class: center, middle, inverse, title-slide # Sequence Analysis ## A (practical) introduction ### Nicola Barban
Alma Mater Studiorum Università di Bologna
Dipartimento di Scienze Statistiche ### 8 April 2021
--- # Outline * **Introduction to Sequence Analysis in social sciences** - SA as description tool for life course trajectories - Measuring distances among trajectories (Optimal Matching) - Data reduction (clustering) * **R code available [here:](lectureSA.R) ** * **Example Data available [here:](data_EE.csv) ** * **Examples based on: ** > Sironi, Maria, Nicola Barban, Luca Maria Pesando, and Frank Furstenberg. 2020. "A Sequence-Analysis Approach to the Study of the Transition to Adulthood in Low- and Middle-Income Countries." University of Pennsylvania Population Center Working Paper (PSC/PARC), 2020-48. https://repository.upenn.edu/psc_publications/48. --- # Life course paradigm * The life course paradigm (Elder, 1985, 1994; Giele and Elder, 1998): individuals, as human agents, build their future on the basis of the **constraints and opportunities** experienced in the past * The process is **iterative and cumulative**, since initial advantages or disadvantages often are amplified with time * Life courses are embedded in different time and location affected by the **social context** in which individuals live **life domains are strongly interdependent**. --- # Transitions ## **Transitions** * A transition is a discrete life change or event within a trajectory (e.g., from single to married). * Transitions are often accompanied by socially shared ceremonies and rituals, such as a graduation or a wedding ceremony <img src="img/laurea.jpg" alt="" class="center" width="350" height="350" > --- # Trajectories ## **Trajectories** * A trajectory can also be envisioned as a sequence of transitions that are enacted over time * A trajectory is a sequence of linked states within a conceptually defined range of behavior or experience. * A trajectory is a long-term pathway, with age-graded patterns of development in major social institutions such as education or family > Taking the whole trajectory as an input in statistical analysis is not straightforward (George, 2009). <img src="img/lifecourse.jpg" alt="" class="center" width="300" height="300" > --- # Life course analysis Life course analysis is the statistical analysis of data on: 1. the **timing** of events (when do events happen?), 2. their **quantum** (how many events happen?). 3. and their **sequencing** (in which order do events happen?) --- # Sequence analysis * In the 1990s Abbott introduced sequence analysis in the social sciences. Origins in information science and computational biology (DNA) (Abbott, 1995) * Life courses are represented in terms of sequences of states (time is intrinsically discrete) * As a simple example, we shall consider three states: single (**S**), cohabiting (**C**), married (**M**), in a monthly time scale from age 20 to 24. The sequence representation of an individual life course may thus be: **SSSSSSCCCCCCCSSSSSSSSSSSSSSSSSSSSSSCCCSSSSSSSSSSSMMMMMMMMM** ### Sequence analysis is a set of techniques for **describing and analyzing trajectories** --- <img src="img/seqs.png" alt="" class="center" width="750" height="350" > --- # how to import sequences in R ```r library(TraMineR) library(tidyverse) library(knitr) library(ggsci) # Only data on Eastern Europe df<-read_csv("data_EE.csv") colors=pal_npg("nrc", alpha = .9)(9)[c(1:4,7)] labels=c("No Sex-No Union","Sex- No Union","Union-No Sex","No Children","Children") scode=c("NoSex-NoUn","Sex-NoUn","Un-NoSex","NoChild", "Child") *seqdata=seqdef(df[,19:37], cnames=12:30, cpal=colors, missing=NA, states=scode, labels=labels, missing.color="white", weights=df$aw) ``` --- ```r example <- seqdata[1:5,1:10] kable(example) ``` |12 |13 |14 |15 |16 |17 |18 |19 |20 |21 | |:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------| |NoSex-NoUn |NoSex-NoUn |NoSex-NoUn |NoSex-NoUn |NoSex-NoUn |NoSex-NoUn |NoSex-NoUn |NoSex-NoUn |Sex-NoUn |Sex-NoUn | |NoSex-NoUn |NoSex-NoUn |NoSex-NoUn |NoSex-NoUn |NoSex-NoUn |Un-NoSex |NoChild |NoChild |Child |Child | |NoSex-NoUn |NoSex-NoUn |NoSex-NoUn |NoSex-NoUn |NoSex-NoUn |NoSex-NoUn |NoSex-NoUn |NoSex-NoUn |NoSex-NoUn |NoSex-NoUn | |NoSex-NoUn |NoSex-NoUn |NoSex-NoUn |NoSex-NoUn |NoSex-NoUn |NoSex-NoUn |NoSex-NoUn |NoSex-NoUn |NoSex-NoUn |NoSex-NoUn | |NoSex-NoUn |NoSex-NoUn |NoSex-NoUn |NoSex-NoUn |NoSex-NoUn |NoSex-NoUn |NoSex-NoUn |NoChild |NoChild |NoChild | --- # State Permanence Sequence (SPS) representation ```r *seqformat(example, from = "STS", to = "SPS",compress = TRUE) ``` ``` ## Sequence ## 1 "(NoSex-NoUn,8)-(Sex-NoUn,2)" ## 2 "(NoSex-NoUn,5)-(Un-NoSex,1)-(NoChild,2)-(Child,2)" ## 3 "(NoSex-NoUn,10)" ## 4 "(NoSex-NoUn,10)" ## 5 "(NoSex-NoUn,7)-(NoChild,3)" ``` --- # Distinct States Sequence (DSS) ```r seqdss(example) ``` ``` ## Sequence ## 1 NoSex-NoUn-Sex-NoUn ## 2 NoSex-NoUn-Un-NoSex-NoChild-Child ## 3 NoSex-NoUn ## 4 NoSex-NoUn ## 5 NoSex-NoUn-NoChild ``` --- # Describe sequences ```r seqIplot(example) ``` <img src="lectureSA_files/figure-html/unnamed-chunk-5-1.png" width="60%" /> --- # Distribution Plot ```r seqdplot(seqdata) ``` <img src="lectureSA_files/figure-html/unnamed-chunk-6-1.png" width="60%" /> --- # Mean time ```r seqmtplot(seqdata) ``` <img src="lectureSA_files/figure-html/unnamed-chunk-7-1.png" width="60%" /> --- # Derive age at events ``` duration<-seqistatd(seqdata) ``` <img src="lectureSA_files/figure-html/unnamed-chunk-8-1.png" width="60%" /> --- # Calculate time between events <img src="lectureSA_files/figure-html/unnamed-chunk-9-1.png" width="60%" /> --- class: center, inverse background-image: url("img/TimeAtEvents.pdf") --- # Categorical time series **Life course trajectories can be described as categorical time series** * Each individual `\(i\)` can be associated to a variable sit indicating her/his life course status at time `\(t\)`. * As one can assume that `\(s_{it}\)` takes a finite number of values, trajectories can be described as categorical time series. * More formally, let us define a discrete-time stochastic process `\(S_t\)` : `\(t ∈ T\)` with state-space `\(Σ = {σ_1,\ldots, σ_K}\)` with realizations `\(s_{it}\)` and `\(i = 1\)`. The life course trajectory of individual `\(i\)` is described by the sequence `\(si = {s_{i1} . . . s_{iT} }\)`. --- # Distance between categorical time series **How to compare life trajectories?** > The **Levenshtein distance** between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being **insertion**, **deletion**, or **substitution** of a single character. It is named after Vladimir Levenshtein, who considered this distance in 1965.(Wikipedia) https://phiresky.github.io/levenshtein-demo/ --- # Optimal Matching * Andrew Abbott introduce the idea of sequence analysis in social science * Great intuition! setting **different substitution costs**. Not all transitions are equal! * First applications with costs based on theory <img src="img/abbottAJS1990.png" alt="Abbott, American Journal of Sociology, 1990" class="centered image" width="350" height="350"> --- # Computing distances A set that is composed of three basic operations `\(Ω = {ι,δ,σ}\)`, 1. `\(ι\)` denotes **insertion** (one state is inserted into the sequence) 2. `\(δ\)` denotes **deletion** (one state is deleted from the sequence) 3. `\(σ\)` denotes **substitution** (one state is replaced by another state). **** * To each of these elementary operations `\(ω_k ∈ Ω,\)` a specific cost can be assigned, `\(c(ω_k)\)`. * If `\(K\)` basic operations must be performed to transform one sequence into another the transformation cost can be computed as `\(c(ω_1,\ldots, ω_K) = \sum^K_{k=1} c(ωk).\)` ### The **Matrix of dissimilarities** `\((NxN)\)` includes all the pairwise distances --- # what choice of costs? * **substitution costs** can be derived from theory (e.g. occupational prestige, military ranking) * using **inverse transition rates** (i.e. rare transitions have "higher costs") ```r submat<-seqsubm(seqdata, method="TRATE", with.missing=TRUE) kable(submat) ``` | | NoSex-NoUn->| Sex-NoUn->| Un-NoSex->| NoChild->| Child->| *->| |:------------|------------:|----------:|----------:|---------:|--------:|---:| |NoSex-NoUn-> | 0.000000| 1.961658| 1.985854| 1.940990| 1.991982| 2| |Sex-NoUn-> | 1.961658| 0.000000| 2.000000| 1.746838| 1.932960| 2| |Un-NoSex-> | 1.985854| 2.000000| 0.000000| 1.595756| 1.687003| 2| |NoChild-> | 1.940990| 1.746838| 1.595756| 0.000000| 1.524420| 2| |Child-> | 1.991982| 1.932960| 1.687003| 1.524420| 0.000000| 2| |*-> | 2.000000| 2.000000| 2.000000| 2.000000| 2.000000| 0| --- # Computing distances ```r *dist.om1=seqdist(seqdata, method="OM", indel=1, sm=submat, with.missing=TRUE) kable(dist.om1[1:8,1:8], digits=2) ``` | | | | | | | | | |-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:| | 0.00| 9.73| 3.96| 7.52| 11.53| 5.46| 11.99| 21.80| | 9.73| 0.00| 11.73| 12.00| 13.13| 6.00| 3.52| 27.78| | 3.96| 11.73| 0.00| 3.56| 12.14| 7.44| 13.99| 17.85| | 7.52| 12.00| 3.56| 0.00| 13.93| 6.00| 14.00| 15.93| | 11.53| 13.13| 12.14| 13.93| 0.00| 12.68| 16.66| 23.50| | 5.46| 6.00| 7.44| 6.00| 12.68| 0.00| 8.00| 21.91| | 11.99| 3.52| 13.99| 14.00| 16.66| 8.00| 0.00| 29.82| | 21.80| 27.78| 17.85| 15.93| 23.50| 21.91| 29.82| 0.00| --- # How to use dissimilarities? 1. Describe groups 2. Match individuals with similar trajectories 3. Compute **typical trajectories** using data reduction techniques --> cluster analysis ```r library(cluster) seq.clusterward = agnes(dist.om1, diss = T, method = "ward") ``` --- # Choose the number of clusters ```r cl4 = cutree(seq.clusterward, k = 4) ``` <img src="img/clusters_choice.png" alt="" class="center" width="300" height="300"> ```r library(WeightedCluster ) wardRange <- as.clustrange(seq.clusterward, diss = dist.om2, ncluster = 10, weights=data_sample$aw) summary(wardRange, max.rank = 2) plot(wardRange, stat = c("ASW", "HG", "PBC", "HC")) ``` --- class: center <img src="img/clusters_ALL.pdf" alt="" class="center" width="650" height="550"> --- class: center <img src="img/clustersolution_meantime.pdf" alt="" class="center" width="650" height="550"> --- # Early Rapid Transition <img src="img/cluster1map.pdf" alt="" class="center" width="650" height="550"> --- # Rapid Transition <img src="img/cluster2map.pdf" alt="" class="center" width="650" height="550"> --- # Gradual Transition <img src="img/cluster3map.pdf" alt="" class="center" width="650" height="550"> --- # Delayed Rapid transition <img src="img/cluster4map.pdf" alt="" class="center" width="650" height="550"> --- # Describe prevalence over time <img src="img/Fig5.eps" alt="" class="center" width="650" height="550"> --- # Additional analysis * Possible trajectories configuration are very high! * Cluster analysis is a way to reduce complexity in few categories * It is very common to combine cluster analysis with additional analyses: 1. cluster as **outcome variable** 2. cluster as **independent variable** --- # Example Association between log GDP and "Early Transition" <img src="img/gdp_cl1.pdf" alt="" class="center" width="650" height="550"> --- # Conclusions * Sequence analysis focuses on **trajectories** (when, how much, and in what order) * Trajectories are **categorical time series** * Series of techniques to **describe and analyze** life course * Optimal matching most common metric to describe **dissimilarities** * Using cluster analysis trajectories are summarised in **typical trjectories** * Often the result of cluster analysis is used in further analysis, typically **multivariate regression** --- class: center, invert ## Questions? ### n.barban@unibo.it