Causalness, frequency, diachrony and the causal-noncausal alternation in Italian and Spanish

Corpus Linguistics and Linguistic Theory - final submission December 2025

Authors

Affiliations

Giulia Mazzola

Newcastle University

Guglielmo Inglese

Università di Torino

Doi

https://doi.org/10.5281/zenodo.17966594

Replication materials for: Mazzola & Inglese, “Causalness, frequency, diachrony and the causal-noncausal alternation in Italian and Spanish” in Corpus Linguistics and Linguistic Theory (Accepted December 2025). Full citation information will be released after publication.

#| label: setup-packages
#| echo: false
#| message: false
#| warning: false
#| results: "hide"

packages <- c(
  "tidyverse", "knitr", "data.table", "DT", "here", "kableExtra",
  "lme4", "openxlsx", "readxl", "wesanderson", "stringr",
  "itsadug", "mgcv", "report", "gratia", "ggrepel", "gridExtra"
)

installed_packages <- rownames(installed.packages())
to_install <- setdiff(packages, installed_packages)

if (length(to_install) > 0) {
  suppressMessages(install.packages(to_install))
}


invisible(lapply(packages, function(pkg) {
  suppressMessages(suppressWarnings(library(pkg, character.only = TRUE)))
}))

#Set working directory where this Rmd file lives
setwd(here::here())

# Load data
  romall<-readxl::read_excel("romall_new_pub.xlsx")
  
  nc<-readxl::read_excel("noncaus_new_pub.xlsx")

Part I: Calculating the causalness degree over time

Cross-linguistic work on the causal-noncausal alternation has shown that there exists a strong correlation between the frequency of use of a given verb in causal vs. noncausal contexts and its preferred coding pattern for the alternation, with verbs typically occurring in causal contexts preferably selecting the anticausative pattern (Haspelmath et al. 2014). These findings rest upon the assumption that the correlation between frequency and coding is diachronic in nature, but this has so far not been tested on historical data.

Data for this study comes from diachronic corpora of Italian (MIDIA, D’Achille & Iacobini 2022) and Spanish (CDH, Real Academia Española 2013), from which the tokens of the 20 verb meanings used in Haspelmath et al. (2014) have been extracted and richly annotated. Each token has been classified according to whether it occurs in a causal or a noncausal context (following Haspelmath et al. 2014: 603, we considered passive, impersonal, reflexive, and reciprocal usages as causal) and based on the type of marking, that is, zero marking, marked anticausative (mainly with the reflexive marker si/se), or marked causative (regardless of the type of auxiliary verb, e.g., fare/hacer ‘make’, lasciare/dejar ‘let’, etc.).

Our dataset romall includes causal and noncausal observations. The subset nc only contains noncausal observations, and only includes “the variable contexts”, only verb lemma with alternation between zero and anticausative marking, since from a variationist point of view is important to only look at contexts where the variation actually occurs.

If we zoom in onto the panchronic percentage of noncausal usage of the verbs and plot them against the spontaneity scale (or causative prominence) proposed by Haspelmath (1993), we can observe that Italian and Spanish substantially behaves like other languages in the sample studied by Haspelmath et al. (2014). Although the data in the plot below displays the expected trend, the results may be problematic in that the percentage of noncausal usages has been calculated over the entire Italian and Spanish datasets. The problem is that, since the data comes from diachronic corpora, one cannot be sure of the accuracy of this measurement, as the causalness degree may in fact be variable over time.

## Script for Figure 2

result_noncaus_meaning <- romall %>%
  group_by(language, meaning) %>%
  summarise(noncaus = sum(semantics == 'noncaus') * 100 / n())

plot_noncaus_tot <- ggplot(result_noncaus_meaning, aes(x = meaning, y = noncaus, color = language, group = language)) +
  geom_smooth(aes(y = noncaus, color = language), method = 'loess', se = FALSE) +
  labs(y = "% noncausal usage", x = "Causative prominence") +
  scale_color_manual(
    values = c("italian" = "gray70", "spanish" = "gray20"),
    labels = c("italian" = "Italian", "spanish" = "Spanish")) +
  coord_cartesian(ylim = c(1, 100)) +
  theme_minimal() +
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "white", color = NA),
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)  # Rotate x-axis labels
  )

custom_order <- c("split", "close", "break", "open", "gather", "connect", "rock", "improve", "rise", "fill", "burn", "turn", "stop", "melt", "sink", "go out", "wake up", "dry", "boil")
plot_noncaus_tot <- plot_noncaus_tot + scale_x_discrete(limits = custom_order)
print(plot_noncaus_tot)

Overall purpose

We would like to include a new predictor in our dataset, namely the proportions of causal transitive uses (vs. non causal uses) of the verbs by time, which we define as causalness degree. In order to do so we need to come up with a periodisation for our data (spanning between 1140 and 2001), which allows to calculate the relative frequency of causal uses by time. To reach a periodisation, a visual survey of the data can be instructive, it might pose the problem of the analyst’s bias, as “different researchers may arrive at different groupings even for the same data set” (Gries & Hilpert 2008:2).

Variability neighbour clustering (VNC)

Gries & Hilpert (2008) propose a more accurate method to come up with meaningful periodisation for a given dataset: Variability-based Neighbor Clustering (VNC). It is similar to hierarchical cluster analysis (HCA), i.e. a method that displays a hierarchy of clusters, typically in the form of dendrograms with branches and leaves. Differently from HCA, VNC does justice to the chronological linearity of linguistic developments by not separating adjacent periods, and shows significant differences between periods based on the standard deviation, which is taken as similarity measure and averaging as amalgamation rule.

The analyst needs to input the proportions of a given variant (in our case, causal uses) for periods, ideally decades. The choice of the periods size is somewhat arbitrary, but making different attempts helps to find the right balance between data sparsity and significance of the VNC results. Because at various points in the diachrony of our dataset data are scarce (see Histogram), we started with periods of 20 years and ended up with periods of 70 years. The VNC periodisation based on periods of 70 years is the one with the best results in statistical terms, as the standard deviation measure for different periods are large enough to show significant differences. Hereby we show the procedure.

ggplot(romall, aes(x=year, fill=semantics)) +
  geom_histogram(alpha = 0.5, position="stack") +
  facet_wrap(~stringr::str_to_title(language))+
  scale_fill_manual(labels=c("Causal", "Noncausal"), values= (c("#fc8961", "#b73779")))+   theme_minimal()

Establish the 70-years periodisation

Step 1: Split the Italian and Spanish dataset.

# Step 1
esp <- filter(romall, language=="spanish")
ita <- filter(romall, language=="italian")

The range of years in the italian dataset is 1200, 1968 and for Spanish 1140, 2001.

Step 2: Create a new variable with the years groups in 70-years periods. Periods are indicated with the central years in the 70 years span, i.e. 1175 stands for the period between 1140 and 1210.

# Step 2
esp<-esp %>% 
  mutate(
    period70_es=case_when(
      year<1210 ~ "1175",
      year<1280 ~ "1245",
      year<1350 ~ "1315", 
      year<1420 ~ "1385", 
      year<1490 ~ "1455",
      year<1560 ~ "1525",
      year<1630 ~ "1595",
      year<1700 ~ "1665",
      year<1770 ~ "1735",
      year<1840 ~ "1805",
      year<1910 ~ "1875",
      year<1980 ~ "1945",
      year<=2001 ~ "1990"
    )
  )

ita<-ita %>% 
  mutate(
    period70_ita=case_when(
      year<1270 ~ "1235",
      year<1340 ~ "1305",
      year<1410 ~ "1375", 
      year<1480 ~ "1445", 
      year<1550 ~ "1515",
      year<1620 ~ "1585",
      year<1690 ~ "1655",
      year<1760 ~ "1725",
      year<1830 ~ "1795",
      year<1900 ~ "1865",
      year<1970 ~ "1935"
    )
  )

Step 3: Create a table with proportions of causal and noncausal uses by 70-years periods

# Step 3

es_perctable70<-table(esp$period70_es,esp$semantics)
es_perctable70<-prop.table(es_perctable70,1)*100
es_perctable70<-as.data.frame(es_perctable70)

ita_perctable70<-table(ita$period70_ita,ita$semantics)
ita_perctable70<-prop.table(ita_perctable70,1)*100 
ita_perctable70<-as.data.frame(ita_perctable70)

Apply the Variability neighbour clustering (VNC)

Step 1: read the VNC instructions and download the VNC file at this link
Step 2: Rename the columns following the instructions from the website

# Step 2

es_perctable70<-filter(es_perctable70, Var2== "caus") %>% 
  rename(year = Var1,input = Freq) %>% 
  select(year, input) 

ita_perctable70<-filter(ita_perctable70, Var2== "caus") %>% 
  rename(year = Var1,input = Freq) %>% 
  select(year, input) 

#round up the proportions  
table_ita<-ita_perctable70
table_ita$input<-round(table_ita$input, digits = 1) 

table_es<-es_perctable70
table_es$input<-round(table_es$input, digits = 1)

Step 3: visualise input table for VNC and print it as a txt file

#Step 3

table_ita %>% 
  kbl(booktabs = TRUE, col.names = c("% Causal uses - Italian", paste0("70 years periods", footnote_marker_symbol(1))), escape = F, caption = "Input table for the VNC") %>% 
  kable_styling()  %>% 
  footnote(symbol = "Periods are indicated with the central years in the 70 years span, i.e. 1175 stands for the period between 1140 and 1210")

Input table for the VNC
% Causal uses - Italian	70 years periods^*
1235	61.7
1305	63.7
1375	63.5
1445	70.6
1515	73.6
1585	67.5
1655	61.4
1725	66.7
1795	63.0
1865	61.8
1935	60.4
^* Periods are indicated with the central years in the 70 years span, i.e. 1175 stands for the period between 1140 and 1210

table_es %>% 
  kbl(booktabs = TRUE, col.names = c("% Causal uses - Spanish", paste0("70 years periods", footnote_marker_symbol(1))), escape = F, caption = "Input table for the VNC") %>% 
  kable_styling()  %>% 
  footnote(symbol = "Periods are indicated with the central years in the 70 years span, i.e. 1175 stands for the period between 1140 and 1210")

Input table for the VNC
% Causal uses - Spanish	70 years periods^*
1175	57.1
1245	53.7
1315	58.5
1385	73.0
1455	56.5
1525	56.6
1595	55.2
1665	62.8
1735	49.6
1805	51.7
1875	50.5
1945	52.0
1990	57.6
^* Periods are indicated with the central years in the 70 years span, i.e. 1175 stands for the period between 1140 and 1210

setcolorder (table_es, c("input", "year"))
write.table(table_es, file = "esp_vnc_caus_70y_p.txt", sep = "\t", quote = FALSE, row.names = F)

setcolorder (table_ita, c("input", "year"))
write.table(table_ita, file = "ita_vnc_caus_70y_p.txt", sep = "\t", quote = FALSE, row.names = F)

Step 4: enter vnc.individual(file.choose()) and, when prompted to choose/enter a file name, choose/enter the path to the file with the data to be analyzed on the harddrive (e.g. newdata.txt). Run this in a separate R script and follow the instructions on the console. The separate script is available in VNC_apr25.R in this repository.

#do not run

load("vnc.individual.RData")

vtxtnc.individual("esp_vnc_caus_70y_p.txt")

load("VNC/vnc.individual.RData")

vnc.individual("VNC/ita_vnc_caus_70y.txt")

Running the previous script will generate two graphs in separate windows: a dendrogram and a screeplot of the type shown below.

The dendogram specifies how periods are clustered (the periods that are the most similar are amalgamated first). The analyst should asses how many development stages are considered relevant to the diachronic study. The screeplot produced by the VNC function allows the linguist to decide by comparing the distances between successive mergers (measured with standard deviations). Essentially, one should observe the slope from left to right, and take the number of clusters with the largest differences, until the slope is leveled off (i.e., until the difference between clusters is marginal). In this case 4 or 6 clusters seems adequate. Given that we want to minimise data sparsity, we make the most parsimonious choice and select the first 4 clusters.

Dendogram Italian Scree plot Italian

Dendogram Spanish Scree plot Spanish

From the interpretation of the VNC results we end up with 4 significant periods for the development of the causal vs. non causal alternation in Spanish, dividing our dataset in the following periods: 1140-1209,1210-1279, 1280-1559, 1560-2001. Data are distributed across periods as summarised below. In Spanish causal uses are consistently higher than noncausal ones, but always below 60% except in the second period where they reach 73%.

esp<-esp %>% 
  mutate(
    es_vnc70_apr25=case_when(
      year<1350 ~ "1140-1349",
      year<1420 ~ "1350-1419",
      year<1700 ~ "1420-1699",
      year<2002 ~ "1700-2001"
    )
  )

summary_es<-esp %>%
  group_by (es_vnc70_apr25, semantics) %>%
  summarise (n=n()) %>%
  mutate(rel.freq = paste0(round(100 * n/sum(n), 0), "%"))

summary_es %>%  kbl(booktabs = TRUE, col.names = c("VNC periods", "Semantics", "Count", "Relative Frequency (%)"), caption = "Spanish - Summary of the verb semantics by VNC periods") %>%
   collapse_rows(column = 1) %>% 
  kable_styling()

Spanish - Summary of the verb semantics by VNC periods
VNC periods	Semantics	Count	Relative Frequency (%)
1140-1349	caus	434	55%
1140-1349	noncaus	348	45%
1350-1419	caus	92	73%
1350-1419	noncaus	34	27%
1420-1699	caus	1367	57%
1420-1699	noncaus	1044	43%
1700-2001	caus	1653	52%
1700-2001	noncaus	1523	48%

For Italian, the relevant 4 periods indicated by the dendogram are 1200-1409, 1410-1549, 1550-1619 and 1620-1968, Data are distributed across periods as summarised below. In Italian, causal uses are consistently more frequent than non causal ones, always higher than 60%. Their relative frequency is at the highest in the second period, when they reach 73%, like in Spanish. However, the second VNC period in Italian is about a century later than in Spanish.

ita<-ita %>% 
  mutate(
    ita_vnc70_apr25=case_when(
      year<1410 ~ "1200-1409",
      year<1550 ~ "1410-1549",
      year<1620 ~ "1550-1619",
      year<=1970 ~ "1620-1968"
    )
  )

summary_ita<-ita %>%
  group_by (ita_vnc70_apr25, semantics) %>%
  summarise (n=n()) %>%
  mutate(rel.freq = paste0(round(100 * n/sum(n), 0), "%"))

summary_ita %>%  kbl(booktabs = TRUE, col.names = c("VNC periods", "Semantics", "Count", "Relative Frequency (%)"), caption = "Italian - Summary of the verb semantics by VNC periods") %>%
   collapse_rows(column = 1) %>% 
  kable_styling()

Italian - Summary of the verb semantics by VNC periods
VNC periods	Semantics	Count	Relative Frequency (%)
1200-1409	caus	768	64%
1200-1409	noncaus	441	36%
1410-1549	caus	1135	73%
1410-1549	noncaus	421	27%
1550-1619	caus	695	67%
1550-1619	noncaus	335	33%
1620-1968	caus	2700	62%
1620-1968	noncaus	1635	38%

Calculate the `Causalness Degree`

Once we found out the significant periods for the causal/non causal alternation, we calculate the share of causal uses by verb lemma and by period. Namely, every verb lemma in each of the four periods is assigned a number between 0 and 1, indicating the proportion (%) of causal uses vs. non causal uses. The table below summarises the distribution of causal uses by lemma and period (variable caus_use).

Italian:

caus_use_ita<-ita %>%
  group_by (ita_vnc70_apr25, lemma, semantics) %>%
  summarise (n=n()) %>%
  mutate(caus_use_ita_apr25 = n/sum(n))


caus_use_ita<-caus_use_ita %>% filter(semantics=="caus") %>% ungroup() %>% select (-semantics)

caus_use_ita %>% DT::datatable(rownames=F, filter="top")

Spanish:

caus_use_es <-esp %>%
  group_by (es_vnc70_apr25, lemma, semantics) %>%
  summarise (n=n()) %>%
  mutate(caus_use_es_apr25 = n/sum(n))


caus_use_es<-caus_use_es %>% filter(semantics=="caus") %>% ungroup() %>% select (-semantics)

caus_use_es %>% DT::datatable(rownames=F, filter="top")

And finally rejoin the individual languages new information (period70, vnc and causalness degree) to the main dataset:

# 1. Join caus_use_ita to ita by vnc and lemma
ita <- ita %>%
  left_join(caus_use_ita, by = c("ita_vnc70_apr25" = "ita_vnc70_apr25", "lemma" = "lemma"))

# 2. Join caus_use_es to esp by vnc and lemma
esp <- esp %>%
  left_join(caus_use_es, by = c("es_vnc70_apr25" = "es_vnc70_apr25", "lemma" = "lemma"))

names(ita)[duplicated(names(ita))]

character(0)

# 3. Rename columns to match target names before merging into romall
ita <- ita %>%
  select(-any_of(c("vnc_period_apr25", "causalness_degree_apr25"))) %>%
  rename(
    vnc_period_apr25 = ita_vnc70_apr25,
    causalness_degree_apr25 = caus_use_ita_apr25
  )

esp <- esp %>%
  select(-any_of(c("vnc_period_apr25", "causalness_degree_apr25"))) %>%
  rename(
    vnc_period_apr25 = es_vnc70_apr25,
    causalness_degree_apr25 = caus_use_es_apr25
  )

# 4. Bind the two enriched subsets together
combined <- bind_rows(ita, esp)

# 5. Join back to romall by id and lemma
romall <- romall %>%
  select(-any_of(c("vnc_period_apr25", "causalness_degree_apr25"))) %>%
  left_join(combined %>% select(id, lemma, vnc_period_apr25, causalness_degree_apr25),
            by = c("id", "lemma"))


nc <- nc %>%
  select(-any_of(c("vnc_period_apr25", "causalness_degree_apr25"))) %>%
  left_join(combined %>% select(id, lemma, vnc_period_apr25, causalness_degree_apr25),
            by = c("id", "lemma"))

#only run this if you want to save the results of the VNC in a new xlsx file

openxlsx::write.xlsx(romall, "romall_new_VNC.xlsx")
openxlsx::write.xlsx(nc, "noncaus_new_VNC.xlsx")

Diachronic analysis of the noncausal usage and causalness degree correlation

After calculating the diachronic causalness degree, we follow the procedure outlined in Heidinger (2015). For each verb in the two languages, we measure its correlation with the percentage of anticausative marking, iterating the count separately for each of the four periods. The results are shown in Figure 6 and Figure 7:

##Calculate causalness degree by language and period

caus_degree_ita <- romall %>%
  filter (language == 'italian') %>%
  group_by(meaning, vnc_period_apr25) %>%
  summarise(caus_degree = sum(semantics == 'caus') * 100 / n())

caus_degree_es <- romall %>%
  filter (language == 'spanish') %>%
  group_by(meaning, vnc_period_apr25) %>%
  summarise(caus_degree = sum(semantics == 'caus') * 100 / n())

##Calculate percentage of anticausative marking by language and period

coding_percentages_ita <- romall %>%
  group_by(meaning, semantics, coding, vnc_period_apr25) %>%
  filter (language == 'italian') %>%
  summarise(count = n()) %>%
  spread(coding, count, fill = 0) %>%
  mutate(caus_degree_vs_zero = caus * 100 / (caus + zero),
         prop_antic_vs_zero = antic * 100 / (antic + zero))
  

coding_percentages_es <- romall %>%
  group_by(meaning, semantics, coding, vnc_period_apr25) %>%
  filter (language == 'spanish') %>%
  summarise(count = n()) %>%
  spread(coding, count, fill = 0) %>%
  mutate(caus_degree_vs_zero = caus * 100 / (caus + zero),
         prop_antic_vs_zero = antic * 100 / (antic + zero))

##Scatter Plot of correlations between causalness degree and anticausative marking by period and language

merged_data_ita <- merge(caus_degree_ita, coding_percentages_ita, by = c("meaning", "vnc_period_apr25"))

merged_data_ita_noncaus <- merged_data_ita %>%
  filter(semantics == 'noncaus')

merged_data_ita_noncaus_1 <- merged_data_ita_noncaus %>%
  filter(vnc_period_apr25 == '1200-1409')

merged_data_ita_noncaus_2 <- merged_data_ita_noncaus %>%
  filter(vnc_period_apr25 == '1410-1549')

merged_data_ita_noncaus_3 <- merged_data_ita_noncaus %>%
  filter(vnc_period_apr25 == '1550-1619')

merged_data_ita_noncaus_4 <- merged_data_ita_noncaus %>%
  filter(vnc_period_apr25 == '1620-1968')

merged_data_es <- merge(caus_degree_es, coding_percentages_es, by = c("meaning", "vnc_period_apr25"))

merged_data_es_noncaus <- merged_data_es %>%
  filter(semantics == 'noncaus')

merged_data_es_noncaus_1 <- merged_data_es_noncaus %>%
  filter(vnc_period_apr25 == '1140-1349')

merged_data_es_noncaus_2 <- merged_data_es_noncaus %>%
  filter(vnc_period_apr25 == '1350-1419')

merged_data_es_noncaus_3 <- merged_data_es_noncaus %>%
  filter(vnc_period_apr25 == '1420-1699')

merged_data_es_noncaus_4 <- merged_data_es_noncaus %>%
  filter(vnc_period_apr25 == '1700-2001')

plot_ita_1 <- ggplot(merged_data_ita_noncaus_1, aes(x = caus_degree, y = prop_antic_vs_zero, label = meaning)) +
  geom_point() +
  labs(x = "Causalness",
       y = "%anticausative", title = "1200-1409") +
  theme_minimal() +
  theme(text = element_text(size = 14))

plot_ita_2 <- ggplot(merged_data_ita_noncaus_2, aes(x = caus_degree, y = prop_antic_vs_zero, label = meaning)) +
  geom_point() +
  labs(x = "Causalness",
       y = "%anticausative", title = "1410-1549") +
  theme_minimal() +
  theme(text = element_text(size = 14))

plot_ita_3 <- ggplot(merged_data_ita_noncaus_3, aes(x = caus_degree, y = prop_antic_vs_zero, label = meaning)) +
  geom_point() +
  labs(x = "Causalness",
       y = "%anticausative", title = "1550-1619") +
  theme_minimal() +
  theme(text = element_text(size = 14))

plot_ita_4 <- ggplot(merged_data_ita_noncaus_4, aes(x = caus_degree, y = prop_antic_vs_zero, label = meaning)) +
  geom_point() +
  labs(x = "Causalness",
       y = "%anticausative", title = "1620-1968") +
  theme_minimal() +
  theme(text = element_text(size = 14))

grid.arrange(plot_ita_1, plot_ita_2, plot_ita_3, plot_ita_4, nrow = 2, ncol = 2,  top = "Italian Correlation Plot")

##Script for Figure 7
plot_es_1 <- ggplot(merged_data_es_noncaus_1, aes(x = caus_degree, y = prop_antic_vs_zero, label = meaning)) +
  geom_point() +
  labs(x = "Causalness",
       y = "%anticausative", title = "1140-1349") +
  theme_minimal() +
  theme(text = element_text(size = 14))

plot_es_2 <- ggplot(merged_data_es_noncaus_2, aes(x = caus_degree, y = prop_antic_vs_zero, label = meaning)) +
  geom_point() +
  labs(x = "Causalness",
       y = "%anticausative", title = "1350-1419") +
  theme_minimal() +
  theme(text = element_text(size = 14))

plot_es_3 <- ggplot(merged_data_es_noncaus_3, aes(x = caus_degree, y = prop_antic_vs_zero, label = meaning)) +
  geom_point() +
  labs(x = "Causalness",
       y = "%anticausative", title = "1420-1699") +
  theme_minimal() +
  theme(text = element_text(size = 14))

plot_es_4 <- ggplot(merged_data_es_noncaus_4, aes(x = caus_degree, y = prop_antic_vs_zero, label = meaning)) +
  geom_point() +
  labs(x = "Causalness",
       y = "%anticausative", title = "1700-2001") +
  theme_minimal() +
  theme(text = element_text(size = 14))

grid.arrange(plot_es_1, plot_es_2, plot_es_3, plot_es_4, nrow = 2, ncol = 2,  top = "Spanish Correlation Plot")

The corresponding correlation measures (Spearman Correlation Coefficient, SCC), alongside with p-values and the number of tokens analyzed for each period are given in the tables below:

## Function to format the p-values
format_pval <- function(p) {
  case_when(
    p < 0.001 ~ "< 0.001***",
    p < 0.01  ~ "< 0.01**",
    p < 0.05  ~ "< 0.05*",
    TRUE      ~ paste0("= ", round(p, 2))
  )
}

# Italian

## Individual periods datasets for Italian
datasets <- list(
  Period1 = merged_data_ita_noncaus_1,
  Period2 = merged_data_ita_noncaus_2,
  Period3 = merged_data_ita_noncaus_3,
  Period4 = merged_data_ita_noncaus_4
)

## Make a dataframe with the Spearman correlation and p-values
cor_table_ita <- purrr::imap_dfr(datasets, function(df, period) {
  test_result <- cor.test(df$caus_degree, df$prop_antic_vs_zero, method = "spearman", exact = FALSE)
  
  tibble(
    Period = period,
    Spearman_Correlation = round(test_result$estimate, 2),
    P_value = format_pval(test_result$p.value)
  )
})

## Show with kableExtra
cor_table_ita %>%
  kable("html", caption = "Italian - Spearman Correlation Between Causalness Degree and Anticausative Usage by Period") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))

Italian - Spearman Correlation Between Causalness Degree and Anticausative Usage by Period
Period	Spearman_Correlation	P_value
Period1	0.28	= 0.24
Period2	0.79	< 0.001***
Period3	0.58	< 0.01**
Period4	0.59	< 0.01**

# Spanish

## Individual periods datasets for Italian
datasets_es <- list(
  Period1 = merged_data_es_noncaus_1,
  Period2 = merged_data_es_noncaus_2,
  Period3 = merged_data_es_noncaus_3,
  Period4 = merged_data_es_noncaus_4
)

## Make a dataframe with the Spearman correlation and p-values
cor_table_es <- purrr::imap_dfr(datasets_es, function(df, period) {
  test_result <- cor.test(df$caus_degree, df$prop_antic_vs_zero, method = "spearman", exact = FALSE)
  
  tibble(
    Period = period,
    Spearman_Correlation = round(test_result$estimate, 2),
    P_value = format_pval(test_result$p.value)
  )
})

## Show with KableExtra
cor_table_es %>%
  kable("html", caption = "Spearman Correlation Between Causalness and Anticausative Usage by Period (Spanish)") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))

Spearman Correlation Between Causalness and Anticausative Usage by Period (Spanish)
Period	Spearman_Correlation	P_value
Period1	0.36	= 0.22
Period2	0.43	= 0.19
Period3	0.04	= 0.88
Period4	0.15	= 0.52

Lexical or semantic effects?

The correlations discussed above are based on verb meaning, but we maintain that it is important to also discuss lemma-effects.

In particular, if the degree of causalness were a direct consequence of the degree of spontaneity of the event (Heidinger 2015), we would expect verbs that lexicalize the same event (and therefore presumably share the same degree of spontaneity) to have substantially the same causalness degree. We have verified whether this is the case by looking at the Italian data for period IV.

# Step 1: Calculate causal degree for Italian lemmas
caus_degree_ita_lemma <- romall %>%
  filter(language == "italian") %>%
  group_by(lemma, vnc_period_apr25) %>%
  summarise(caus_degree = sum(semantics == "caus") * 100 / n(), .groups = "drop")

# Step 2: Calculate coding percentages
coding_percentages_ita_lemma <- romall %>%
  filter(language == "italian") %>%
  group_by(lemma, semantics, coding, vnc_period_apr25) %>%
  summarise(count = n(), .groups = "drop") %>%
  pivot_wider(names_from = coding, values_from = count, values_fill = 0) %>%
  mutate(
    caus_degree_vs_zero = caus * 100 / (caus + zero),
    caus_degree_vs_zero = antic * 100 / (antic + zero)
  )

# Step 3: Merge the datasets
merged_data_ita_lemma <- left_join(
  caus_degree_ita_lemma,
  coding_percentages_ita_lemma,
  by = c("lemma", "vnc_period_apr25")
)

# Step 4: Filter for noncausative semantics and specific time period
merged_data_ita_noncaus_4_lemma <- merged_data_ita_lemma %>%
  filter(semantics == "noncaus", vnc_period_apr25 == "1620-1968")

# Step 5: Create verb meaning groups
groupings <- list(
  burn = c("ardere", "bruciare"),
  fill = c("empiere", "riempire"),
  melt = c("sciogliere", "fondere"),
  rise = c("alzare", "sollevare"),
  rock = c("dondolare", "oscillare"),
  split = c("dividere", "separare"),
  stop = c("arrestare", "fermare")
  )

# Assign group labels
merged_data_ita_noncaus_4_lemma <- merged_data_ita_noncaus_4_lemma %>%
  mutate(group = map_chr(lemma, function(l) {
    matched_group <- names(keep(groupings, ~ l %in% .x))
    if (length(matched_group) > 0) matched_group else NA_character_
  })) %>%
  filter(!is.na(group)) %>%
  mutate(group = factor(group, levels = names(groupings)))

# Step 6: Plot
ggplot(merged_data_ita_noncaus_4_lemma, aes(
  x = caus_degree,
  y = caus_degree_vs_zero,
  label = lemma,
  color = group,
  fill = group
)) +
  geom_point() +
  geom_text_repel(
    data = merged_data_ita_noncaus_4_lemma %>%
      filter(lemma %in% unlist(groupings)),
    aes(color = group),
    hjust = 0.5,
    vjust = -0.5,
    size = 4
  ) +
  labs(
    x = "Causalness",
    y = "% Anticausative"
  ) +
  theme_minimal() +
  theme(text = element_text(size = 14)) +
  scale_color_manual(
    name = "Meaning",
    values = c(
      burn = "red", fill = "blue", melt = "purple", rise = "brown",
      rock = "green", split = "cyan", stop = "orange"
    )
  ) +
  scale_fill_manual(
    name = "Meaning",
    values = c(
      burn = "red", fill = "blue", melt = "purple", rise = "brown",
      rock = "green", split = "cyan", stop = "orange"
    )
  ) +
  guides(color = guide_legend(override.aes = list(shape = 16, size = 4)))

Part II: Frequency effects of causal degree on anticausative marking over time

# only non causal, verbs with categorical selection removed

ita_nc<-nc %>% filter(language=="italian")
es_nc<-nc %>% filter(language=="spanish")

ita_nc<-ita_nc %>% select(id, coding, meaning, lemma, vnc_period_apr25, causalness_degree_apr25)

ita_nc<-ita_nc %>% rename(period=vnc_period_apr25, causalness_degree=causalness_degree_apr25)

es_nc<-es_nc %>% select(id, coding, meaning, lemma, vnc_period_apr25, causalness_degree_apr25)

es_nc<-es_nc %>% rename(period=vnc_period_apr25, causalness_degree=causalness_degree_apr25)

Italian data: prepare for Beta Regression

Subset the Italian data to obtain unique combinations of lemma and period.

Step 1: Aggregate data to calculate the proportion of anticausative codings

ita_agg <- ita_nc %>%
  group_by(lemma, period) %>%
  summarise(
    anticausative_count = sum(coding == "antic"),
    total_count = n(),
    proportion_anticausative = anticausative_count / total_count,
    causalness_degree = first(causalness_degree),
      ) %>%
  ungroup()

Step 2: Create a lagged causalness_degree variable

ita_agg <- ita_agg %>%
  arrange(lemma, period) %>%
  group_by(lemma) %>%
  mutate(causalness_lagged = lag(causalness_degree)) %>%
  ungroup() %>%
  # Remove rows without previous causalness_degree (period 1)
  filter(!is.na(causalness_lagged))

Spanish data: prepare for Beta Regression

Subset the Spanish data to obtain unique combinations of lemma and period.

Step 1: Aggregate data to calculate the proportion of anticausative codings:

es_agg <- es_nc %>%
  group_by(lemma, period) %>%
  summarise(
    anticausative_count = sum(coding == "antic"),
    total_count = n(),
    proportion_anticausative = anticausative_count / total_count,
    causalness_degree = first(causalness_degree),
    ) %>%
  ungroup()

Step 2: Create a lagged causalness_degree variable:

es_agg <- es_agg %>%
  arrange(lemma, period) %>%
  group_by(lemma) %>%
  mutate(causalness_lagged = lag(causalness_degree)) %>%
  ungroup() %>%
  # Remove rows without previous causalness_degree (period 1)
  filter(!is.na(causalness_lagged))

Fit the Beta Regression model

Prepare the data for the regression model

Step 1: join the Italian and Spanish sub-dataset
Step 2: make the categorical variable into factors; transform period into a numeric variable to be able to use it as a smooth term
Step 3: handling proportions for beta regression and adjust the numeric variables

In our data, the dependent variable proportion_anticausative takes values between 0 and 1. However, some observations are exactly 0 or 1. This poses a problem for beta regression (and its GAM extension with family = betar), because the beta distribution is only defined on the open interval (0,1).

To address this, we applied a small adjustment:

We created a copy of the original variable (proportion_anticausative_notadj) to preserve the raw values, including 0s and 1s.

We replaced the original proportion_anticausative with a slightly modified version where values equal to 0 are nudged to a very small positive number (ε = 0.000001), and values equal to 1 are nudged to just below 1 (1 – ε).

This ensures that all proportions are strictly within (0,1), making them suitable for beta regression, while still preserving the original variable for transparency and reference.

# Step 1
## Add a column to each dataset to distinguish them
italian <- ita_agg %>%
  mutate(language = "Italian")

spanish <- es_agg %>%
  mutate(language = "Spanish")

## Combine the datasets using bind_rows()
join <- bind_rows(italian, spanish)

# Step 2: prepare factors variables

join$language<-as.factor(join$language)
join$period_cat<-as.factor(join$period)
join$period<-as.factor(join$period)

join$period <- as.numeric(join$period_cat)

# Step 3: adjust the numeric variables to avoid exact 0s and 1s in the dependent variable and to make NAs into 0s for the predictors


epsilon <- 1e-6  # small adjustment so values fall strictly in (0,1)

join <- join %>%
  mutate(
    causalness_degree = ifelse(is.na(causalness_degree), 0, causalness_degree),
    causalness_lagged = ifelse(is.na(causalness_lagged), 0, causalness_lagged),
    proportion_anticausative_notadj = proportion_anticausative,  # keep original
    proportion_anticausative = pmin(pmax(proportion_anticausative, epsilon), 1 - epsilon)  # adjusted
  )

# quick checks
range(join$proportion_anticausative_notadj, na.rm = TRUE)  # should still show 0 and 1

[1] 0 1

range(join$proportion_anticausative, na.rm = TRUE)         # should now be (0,1)

[1] 0.000001 0.999999

all(join$proportion_anticausative > 0 & join$proportion_anticausative < 1)  # should be TRUE

[1] TRUE

Hereby a summary of the dataset with only unique combinations of lemma and period and their anticausative count, proportion, synchronic and lagged causalness degree:

# Summarize the dataset 

summary_table <- join %>%
  group_by(language, lemma, period_cat) %>%
  summarise(
    N_Antic = sum(anticausative_count, na.rm = TRUE),   # Total count of anticausatives
    N_Total = sum(total_count, na.rm = TRUE),             # Total count of all occurrences
    `%_Antic_adjusted` = proportion_anticausative * 100,           # Directly use the original proportion value
    Causalness_Degree = causalness_degree,                # Use the original value
    Causalness_Lagged = causalness_lagged                 # Use the original value
  ) %>%
  ungroup()  # Remove grouping

summary_table %>%
  kable("html", caption = "Summary Table of Anticausative Proportions by Language, Lemma, and Period") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F) %>%
  row_spec(0, bold = TRUE) %>%  # Optional: make the header row bold
  collapse_rows(columns = c(1, 2), valign = "top")  # Collapse the rows for language and lemma

Summary Table of Anticausative Proportions by Language, Lemma, and Period
language	lemma	period_cat	N_Antic	N_Total	%_Antic_adjusted	Causalness_Degree	Causalness_Lagged
Italian	affondare	1410-1549	0	1	0.000100	0.5000000	0.1250000
		1550-1619	1	2	50.000000	0.3333333	0.5000000
		1620-1968	5	21	23.809524	0.1600000	0.3333333
	alzare	1410-1549	14	16	87.500000	0.8181818	0.8214286
		1550-1619	32	32	99.999900	0.6404494	0.8181818
		1620-1968	155	155	99.999900	0.5596591	0.6404494
	ardere	1410-1549	4	95	4.210526	0.4946809	0.5161290
		1550-1619	2	67	2.985075	0.3557692	0.4946809
		1620-1968	0	86	0.000100	0.3722628	0.3557692
	arrestare	1410-1549	3	3	99.999900	0.7857143	0.1111111
		1550-1619	6	7	85.714286	0.5625000	0.7857143
		1620-1968	101	104	97.115385	0.4057143	0.5625000
	bruciare	1550-1619	1	3	33.333333	0.5000000	0.8750000
	bruciare	1620-1968	1	30	3.333333	0.6202532	0.5000000
	chiudere	1410-1549	11	15	73.333333	0.8148148	0.6739130
		1550-1619	4	4	99.999900	0.9090909	0.8148148
		1620-1968	38	38	99.999900	0.8671329	0.9090909
	dividere	1410-1549	14	14	99.999900	0.9034483	0.7826087
		1550-1619	14	14	99.999900	0.8271605	0.9034483
		1620-1968	51	51	99.999900	0.8131868	0.8271605
	fermare	1410-1549	43	46	93.478261	0.4888889	0.8000000
		1550-1619	35	35	99.999900	0.4262295	0.4888889
		1620-1968	172	185	92.972973	0.2773438	0.4262295
	fondere	1410-1549	1	1	99.999900	0.8333333	0.8235294
	fondere	1550-1619	0	2	0.000100	0.0000000	0.8333333
	gelare	1620-1968	7	31	22.580645	0.2954545	0.5000000
	girare	1410-1549	4	34	11.764706	0.4137931	0.1125828
		1550-1619	5	31	16.129032	0.4150943	0.4137931
		1620-1968	16	146	10.958904	0.3209302	0.4150943
	migliorare	1410-1549	0	6	0.000100	0.4545455	0.7692308
		1550-1619	0	4	0.000100	0.6363636	0.4545455
		1620-1968	2	23	8.695652	0.7500000	0.6363636
	radunare	1410-1549	38	39	97.435897	0.5851064	0.6032609
		1550-1619	15	15	99.999900	0.6052632	0.5851064
		1620-1968	69	72	95.833333	0.3513514	0.6052632
	riempire	1410-1549	5	5	99.999900	0.8148148	0.8333333
		1550-1619	2	2	99.999900	0.8823529	0.8148148
		1620-1968	13	13	99.999900	0.8859649	0.8823529
	rompere	1410-1549	16	16	99.999900	0.9030303	0.8586957
		1550-1619	6	6	99.999900	0.8983051	0.9030303
		1620-1968	31	35	88.571429	0.8241206	0.8983051
	sciogliere	1410-1549	7	7	99.999900	0.9078947	0.9393939
		1550-1619	5	6	83.333333	0.9259259	0.9078947
		1620-1968	66	67	98.507463	0.7803279	0.9259259
	seccare	1410-1549	5	9	55.555556	0.4705882	0.7058824
		1550-1619	4	5	80.000000	0.1666667	0.4705882
		1620-1968	4	7	57.142857	0.3636364	0.1666667
	spegnere	1410-1549	31	35	88.571429	0.7784810	0.6716418
		1550-1619	2	4	50.000000	0.8918919	0.7784810
		1620-1968	53	59	89.830508	0.6878307	0.8918919
	svegliare	1410-1549	11	14	78.571429	0.5483871	0.3750000
		1550-1619	7	8	87.500000	0.4666667	0.5483871
		1620-1968	57	61	93.442623	0.5579710	0.4666667
	unire	1410-1549	16	16	99.999900	0.5789474	0.8333333
		1550-1619	38	38	99.999900	0.5128205	0.5789474
		1620-1968	157	159	98.742138	0.5470085	0.5128205
Spanish	abrir	1350-1419	1	1	99.999900	0.5000000	0.9074074
		1420-1699	9	10	90.000000	0.9295775	0.5000000
		1700-2001	26	27	96.296296	0.8732394	0.9295775
	alzar	1350-1419	3	4	75.000000	0.7777778	0.5901639
		1420-1699	32	39	82.051282	0.8088235	0.7777778
		1700-2001	30	34	88.235294	0.6761905	0.8088235
	apagar	1420-1699	22	23	95.652174	0.7012987	0.8750000
	apagar	1700-2001	116	116	99.999900	0.6013746	0.7012987
	cerrar	1420-1699	6	9	66.666667	0.9217391	0.8064516
	cerrar	1700-2001	15	18	83.333333	0.8875000	0.9217391
	congelar	1700-2001	17	18	94.444444	0.2800000	0.4242424
	derretir	1420-1699	49	52	94.230769	0.5737705	0.5000000
	derretir	1700-2001	71	73	97.260274	0.3539823	0.5737705
	despertar	1350-1419	2	5	40.000000	0.2857143	0.5416667
		1420-1699	8	88	9.090909	0.5368421	0.2857143
		1700-2001	43	109	39.449541	0.4953704	0.5368421
	fundir	1420-1699	12	12	99.999900	0.5862069	0.7857143
	fundir	1700-2001	24	24	99.999900	0.2941176	0.5862069
	girar	1420-1699	1	15	6.666667	0.5000000	0.4000000
	girar	1700-2001	6	305	1.967213	0.2738095	0.5000000
	hervir	1420-1699	4	148	2.702703	0.2043011	0.4400000
	hervir	1700-2001	7	118	5.932203	0.1259259	0.2043011
	hundir	1700-2001	99	100	99.000000	0.3630573	0.2448980
	juntar	1350-1419	2	2	99.999900	0.5000000	0.6666667
		1420-1699	157	160	98.125000	0.4539249	0.5000000
		1700-2001	26	26	99.999900	0.5272727	0.4539249
	levantar	1350-1419	4	4	99.999900	0.0000000	0.1803279
	levantar	1700-2001	77	80	96.250000	0.5348837	0.3783784
	llenar	1700-2001	27	28	96.428571	0.7971014	0.7000000
	mejorar	1350-1419	2	5	40.000000	0.1666667	0.2287582
		1420-1699	17	40	42.500000	0.5402299	0.1666667
		1700-2001	9	48	18.750000	0.6220472	0.5402299
	parar	1350-1419	2	2	99.999900	0.8181818	0.5775862
		1420-1699	43	94	45.744681	0.3472222	0.8181818
		1700-2001	16	38	42.105263	0.1363636	0.3472222
	quemar	1350-1419	2	2	99.999900	0.9310345	0.8703704
		1420-1699	38	45	84.444444	0.7867299	0.9310345
		1700-2001	14	17	82.352941	0.7118644	0.7867299
	romper	1420-1699	16	36	44.444444	0.7721519	0.7777778
	romper	1700-2001	19	52	36.538461	0.6645161	0.7721519
	secar	1350-1419	2	4	50.000000	0.2000000	0.4333333
		1420-1699	76	76	99.999900	0.2621359	0.2000000
		1700-2001	28	29	96.551724	0.6233766	0.2621359

Model fit and model selection

Hereby we fit a beta regression model with mgvc::gam: model1 is the maximal model which includes

s(causalness_degree, k=10), s(causalness_lagged, k=10) and s(period, k=5): global smooth terms — these model the overall nonlinear effects of each predictor across all languages.
s(causalness_lagged, by = language, k = 10, bs = "cs", m = 1), s(causalness_degree, by = language, k = 10, bs = "cs", m = 1), and s(period, by = language, k = 5, bs = "cs", m= 1): group-wise smooths (by language) — these allow the shape of the smooth to differ between Italian and Spanish.
family = betar(...): the model uses a beta regression family, designed for proportions or rates strictly between 0 and 1.
link = "logit": the logit link function ensures predicted values remain within the (0, 1) range.
method = "ML": the model is fitted using maximum likelihood estimation (ML) rather than the default restricted maximum likelihood (REML). ML is typically used when comparing models with different fixed-effect structures, since REML estimates are not directly comparable across models with different smooth terms.

model1 <- gam(
  proportion_anticausative ~ 
    s(causalness_lagged, k=10)+
    s(causalness_degree, k=10)+
    s(period, k=5)+
    s(causalness_lagged, by = language, k = 10, bs = "cs", m = 1) +
    s(causalness_degree, by = language, k = 10, bs = "cs", m = 1) +
    s(period, by = language, k = 5, bs = "cs", m= 1),   
  family = betar(link = "logit"),                 
  data = join,
  method="ML"
)

# Summarize the model
summary(model1)


Family: Beta regression(0.536) 
Link function: logit 

Formula:
proportion_anticausative ~ s(causalness_lagged, k = 10) + s(causalness_degree, 
    k = 10) + s(period, k = 5) + s(causalness_lagged, by = language, 
    k = 10, bs = "cs", m = 1) + s(causalness_degree, by = language, 
    k = 10, bs = "cs", m = 1) + s(period, by = language, k = 5, 
    bs = "cs", m = 1)

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.7087     0.1304   5.435 5.49e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                                           edf Ref.df Chi.sq p-value    
s(causalness_lagged)                 1.000e+00      1  1.171 0.27928    
s(causalness_degree)                 1.000e+00      1 11.195 0.00082 ***
s(period)                            1.000e+00      1  0.259 0.61071    
s(causalness_lagged):languageItalian 7.500e-06      9  0.000 0.58048    
s(causalness_lagged):languageSpanish 1.190e-04      9  0.000 0.59603    
s(causalness_degree):languageItalian 1.088e-05      9  0.000 0.18367    
s(causalness_degree):languageSpanish 1.299e+00      9  8.120 0.00261 ** 
s(period):languageItalian            7.832e-05      3  0.000 0.24791    
s(period):languageSpanish            1.844e-05      3  0.000 0.21472    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.148   Deviance explained = 26.7%
-ML = -302.44  Scale est. = 1         n = 99

Step 3: Backward selection

In model2 we start by removing s(period, k=5)+, as it is the term with the highest pvalue. + anova(..., test="Chisq") compares two nested GAM models (e.g., model1 and a simpler model2) using a likelihood ratio test (Chi-squared test) + AIC(): it calculates the AIC of the two models, which is a measure of model fit that balances goodness-of-fit with model complexity (penalizes overfitting)

We continue removing terms with high p-value, until we reach the most parsimonious model.

The anova shows is ok to remove the term.

model2 <- gam(
  proportion_anticausative ~ 
    s(causalness_lagged, k=10)+
    s(causalness_degree, k=10)+
    #s(period, k=5)+
    s(causalness_lagged, by = language, k = 10, bs = "cs", m = 1) +
    s(causalness_degree, by = language, k = 10, bs = "cs", m = 1)+
    s(period, by = language, k = 5, bs = "cs", m= 1),   
  family = betar(link = "logit"),                 
  data = join,
  method="ML"
)

# Summarize the model
summary(model2)


Family: Beta regression(0.536) 
Link function: logit 

Formula:
proportion_anticausative ~ s(causalness_lagged, k = 10) + s(causalness_degree, 
    k = 10) + s(causalness_lagged, by = language, k = 10, bs = "cs", 
    m = 1) + s(causalness_degree, by = language, k = 10, bs = "cs", 
    m = 1) + s(period, by = language, k = 5, bs = "cs", m = 1)

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.7104     0.1305   5.445 5.19e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                                           edf Ref.df Chi.sq  p-value    
s(causalness_lagged)                 1.000e+00      1  1.129 0.288069    
s(causalness_degree)                 1.000e+00      1 11.940 0.000549 ***
s(causalness_lagged):languageItalian 4.501e-06      9  0.000 0.623366    
s(causalness_lagged):languageSpanish 7.298e-05      9  0.000 0.634162    
s(causalness_degree):languageItalian 5.413e-06      9  0.000 0.188334    
s(causalness_degree):languageSpanish 1.322e+00      9  8.454 0.002208 ** 
s(period):languageItalian            5.765e-05      3  0.000 0.514120    
s(period):languageSpanish            1.327e-01      3  0.177 0.244040    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.161   Deviance explained = 26.8%
-ML = -302.32  Scale est. = 1         n = 99

anova(model1, model2, test="Chisq")

Analysis of Deviance Table

Model 1: proportion_anticausative ~ s(causalness_lagged, k = 10) + s(causalness_degree, 
    k = 10) + s(period, k = 5) + s(causalness_lagged, by = language, 
    k = 10, bs = "cs", m = 1) + s(causalness_degree, by = language, 
    k = 10, bs = "cs", m = 1) + s(period, by = language, k = 5, 
    bs = "cs", m = 1)
Model 2: proportion_anticausative ~ s(causalness_lagged, k = 10) + s(causalness_degree, 
    k = 10) + s(causalness_lagged, by = language, k = 10, bs = "cs", 
    m = 1) + s(causalness_degree, by = language, k = 10, bs = "cs", 
    m = 1) + s(period, by = language, k = 5, bs = "cs", m = 1)
  Resid. Df Resid. Dev       Df Deviance Pr(>Chi)
1    92.750    -609.35                           
2    93.361    -609.48 -0.61183  0.12349

AIC(model1, model2)

             df       AIC
model1 6.774768 -595.8037
model2 6.046560 -597.3836

In model3 we need to remove s(causalness_lagged, by = language, k = 10, bs = "cs", m = 1), as it is the term with the highest pvalue. Anova shows it’s not ok to remove.

model3 <- gam(
  proportion_anticausative ~ 
    s(causalness_lagged, k=10)+
    s(causalness_degree, k=10)+
    #s(period, k=5)+
    #s(causalness_lagged, by = language, k = 10, bs = "cs", m = 1) +
    s(causalness_degree, by = language, k = 10, bs = "cs", m = 1)+
    s(period, by = language, k = 5, bs = "cs", m= 1),   
  family = betar(link = "logit"),                 
  data = join,
  method="ML"
)

# Summarize the model
summary(model3)


Family: Beta regression(0.536) 
Link function: logit 

Formula:
proportion_anticausative ~ s(causalness_lagged, k = 10) + s(causalness_degree, 
    k = 10) + s(causalness_degree, by = language, k = 10, bs = "cs", 
    m = 1) + s(period, by = language, k = 5, bs = "cs", m = 1)

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.7104     0.1305   5.445 5.19e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                                           edf Ref.df Chi.sq  p-value    
s(causalness_lagged)                 1.000e+00      1  1.129 0.288085    
s(causalness_degree)                 1.000e+00      1 11.940 0.000549 ***
s(causalness_degree):languageItalian 7.061e-06      9  0.000 0.188330    
s(causalness_degree):languageSpanish 1.322e+00      9  8.455 0.002208 ** 
s(period):languageItalian            1.505e-04      3  0.000 0.514117    
s(period):languageSpanish            1.327e-01      3  0.177 0.244041    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.161   Deviance explained = 26.8%
-ML = -302.32  Scale est. = 1         n = 99

anova(model3, model2, test="Chisq")

Analysis of Deviance Table

Model 1: proportion_anticausative ~ s(causalness_lagged, k = 10) + s(causalness_degree, 
    k = 10) + s(causalness_degree, by = language, k = 10, bs = "cs", 
    m = 1) + s(period, by = language, k = 5, bs = "cs", m = 1)
Model 2: proportion_anticausative ~ s(causalness_lagged, k = 10) + s(causalness_degree, 
    k = 10) + s(causalness_lagged, by = language, k = 10, bs = "cs", 
    m = 1) + s(causalness_degree, by = language, k = 10, bs = "cs", 
    m = 1) + s(period, by = language, k = 5, bs = "cs", m = 1)
  Resid. Df Resid. Dev          Df   Deviance  Pr(>Chi)    
1    93.361    -609.48                                     
2    93.361    -609.48 -0.00015192 -9.173e-05 0.0007147 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AIC(model3, model2)

             df       AIC
model3 6.046661 -597.3835
model2 6.046560 -597.3836

We go back to model2 and in model4 we remove s(period, by = language, k = 5, bs = "cs", m= 1) which is the second next highest p-value. The comparison between model2 and model4 shows it is ok to remove this term.

model4 <- gam(
  proportion_anticausative ~ 
    s(causalness_lagged, k=10)+
    s(causalness_degree, k=10)+
    #s(period, k=5)+
    s(causalness_lagged, by = language, k = 10, bs = "cs", m = 1) +
    s(causalness_degree, by = language, k = 10, bs = "cs", m = 1),
    #s(period, by = language, k = 5, bs = "cs", m= 1),   
  family = betar(link = "logit"),                 
  data = join,
  method="ML"
)
# Summarize the model
summary(model4)


Family: Beta regression(0.535) 
Link function: logit 

Formula:
proportion_anticausative ~ s(causalness_lagged, k = 10) + s(causalness_degree, 
    k = 10) + s(causalness_lagged, by = language, k = 10, bs = "cs", 
    m = 1) + s(causalness_degree, by = language, k = 10, bs = "cs", 
    m = 1)

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.7114     0.1304   5.456 4.86e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                                           edf Ref.df Chi.sq p-value   
s(causalness_lagged)                 1.000e+00      1  1.177 0.27800   
s(causalness_degree)                 1.000e+00      1  0.004 0.94653   
s(causalness_lagged):languageItalian 2.142e-05      9  0.000 0.59328   
s(causalness_lagged):languageSpanish 1.763e-05      9  0.000 0.60515   
s(causalness_degree):languageItalian 1.345e+00      9  8.353 0.00243 **
s(causalness_degree):languageSpanish 1.203e-05      9  0.000 0.14713   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.162   Deviance explained = 26.4%
-ML = -302.15  Scale est. = 1         n = 99

anova(model2, model4, test="Chisq")

Analysis of Deviance Table

Model 1: proportion_anticausative ~ s(causalness_lagged, k = 10) + s(causalness_degree, 
    k = 10) + s(causalness_lagged, by = language, k = 10, bs = "cs", 
    m = 1) + s(causalness_degree, by = language, k = 10, bs = "cs", 
    m = 1) + s(period, by = language, k = 5, bs = "cs", m = 1)
Model 2: proportion_anticausative ~ s(causalness_lagged, k = 10) + s(causalness_degree, 
    k = 10) + s(causalness_lagged, by = language, k = 10, bs = "cs", 
    m = 1) + s(causalness_degree, by = language, k = 10, bs = "cs", 
    m = 1)
  Resid. Df Resid. Dev       Df Deviance Pr(>Chi)
1    93.361    -609.48                           
2    93.674    -609.11 -0.31242   -0.371   0.1941

AIC(model2, model4)

             df       AIC
model2 6.046560 -597.3836
model4 5.835506 -597.4347

In model5 we need to remove s(causalness_degree, k=10), as it is only marginally significant. Anova shows that removing it does not affect the model.

model5 <- gam(
  proportion_anticausative ~ 
    s(causalness_lagged, k=10)+
    #s(causalness_degree, k=10)+
    #s(period, k=5)+
    s(causalness_lagged, by = language, k = 10, bs = "cs", m = 1) +
    s(causalness_degree, by = language, k = 10, bs = "cs", m = 1),
    #s(period, by = language, k = 5, bs = "cs", m= 1),   
  family = betar(link = "logit"),                 
  data = join,
  method="ML"
)
# Summarize the model
summary(model5)


Family: Beta regression(0.535) 
Link function: logit 

Formula:
proportion_anticausative ~ s(causalness_lagged, k = 10) + s(causalness_lagged, 
    by = language, k = 10, bs = "cs", m = 1) + s(causalness_degree, 
    by = language, k = 10, bs = "cs", m = 1)

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.7121     0.1300   5.476 4.34e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                                           edf Ref.df Chi.sq  p-value    
s(causalness_lagged)                 1.000e+00      1  1.507 0.219630    
s(causalness_lagged):languageItalian 6.165e-05      9  0.000 0.614137    
s(causalness_lagged):languageSpanish 4.849e-05      9  0.000 0.623972    
s(causalness_degree):languageItalian 1.401e+00      9 13.356 0.000165 ***
s(causalness_degree):languageSpanish 4.976e-05      9  0.000 0.673066    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.171   Deviance explained = 26.4%
-ML = -302.15  Scale est. = 1         n = 99

anova(model4, model5, test="Chisq")

Analysis of Deviance Table

Model 1: proportion_anticausative ~ s(causalness_lagged, k = 10) + s(causalness_degree, 
    k = 10) + s(causalness_lagged, by = language, k = 10, bs = "cs", 
    m = 1) + s(causalness_degree, by = language, k = 10, bs = "cs", 
    m = 1)
Model 2: proportion_anticausative ~ s(causalness_lagged, k = 10) + s(causalness_lagged, 
    by = language, k = 10, bs = "cs", m = 1) + s(causalness_degree, 
    by = language, k = 10, bs = "cs", m = 1)
  Resid. Df Resid. Dev     Df  Deviance Pr(>Chi)
1    93.674    -609.11                          
2    94.713    -609.04 -1.039 -0.066054   0.8105

AIC(model4, model5)

             df       AIC
model4 5.835506 -597.4347
model5 4.844266 -599.3511

The next candidate for removal is s(causalness_lagged) in model6 (as we tried already removing s(causalness_lagged, by = language, k = 10, bs = "cs", m = 1). It is not completely ok to remove that, because the anova is marginally significant and the AIC is lower for model5.

So we go back to model5 and this time in model7 we remove s(causalness_lagged, by = language, k = 10, bs = "cs", m = 1). It seems that model5 is our final model, as removing this group-wise smooth is not ok.

model6 <- gam(
  proportion_anticausative ~ 
    #s(causalness_lagged, k=10)+
    #s(causalness_degree, k=10)+
    #s(period, k=5)+
    s(causalness_lagged, by = language, k = 10, bs = "cs", m = 1) +
    s(causalness_degree, by = language, k = 10, bs = "cs", m = 1),
    #s(period, by = language, k = 5, bs = "cs", m= 1),   
  family = betar(link = "logit"),                 
  data = join,
  method="ML"
)
# Summarize the model
summary(model6) #


Family: Beta regression(0.53) 
Link function: logit 

Formula:
proportion_anticausative ~ s(causalness_lagged, by = language, 
    k = 10, bs = "cs", m = 1) + s(causalness_degree, by = language, 
    k = 10, bs = "cs", m = 1)

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.7051     0.1303   5.411 6.26e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                                           edf Ref.df Chi.sq  p-value    
s(causalness_lagged):languageItalian 3.147e-01      9  0.476    0.199    
s(causalness_lagged):languageSpanish 1.549e-04      9  0.000    0.553    
s(causalness_degree):languageItalian 1.464e+00      9 16.542 1.06e-05 ***
s(causalness_degree):languageSpanish 5.403e-05      9  0.000    0.946    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.148   Deviance explained = 25.7%
-ML = -301.38  Scale est. = 1         n = 99

anova(model6, model5, test="Chisq")

Analysis of Deviance Table

Model 1: proportion_anticausative ~ s(causalness_lagged, by = language, 
    k = 10, bs = "cs", m = 1) + s(causalness_degree, by = language, 
    k = 10, bs = "cs", m = 1)
Model 2: proportion_anticausative ~ s(causalness_lagged, k = 10) + s(causalness_lagged, 
    by = language, k = 10, bs = "cs", m = 1) + s(causalness_degree, 
    by = language, k = 10, bs = "cs", m = 1)
  Resid. Df Resid. Dev      Df Deviance Pr(>Chi)  
1    94.929    -608.43                            
2    94.713    -609.04 0.21632  0.60493  0.09856 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AIC(model6, model5)

             df       AIC
model6 4.424766 -599.5852
model5 4.844266 -599.3511

model7 <- gam(
  proportion_anticausative ~ 
    s(causalness_lagged, k=10)+
    #s(causalness_degree, k=10)+
    #s(period, k=5)+
    #s(causalness_lagged, by = language, k = 10, bs = "cs", m = 1) +
    s(causalness_degree, by = language, k = 10, bs = "cs", m = 1),
    #s(period, by = language, k = 5, bs = "cs", m= 1),   
  family = betar(link = "logit"),                 
  data = join,
  method="ML"
)
# Summarize the model
summary(model7) #


Family: Beta regression(0.535) 
Link function: logit 

Formula:
proportion_anticausative ~ s(causalness_lagged, k = 10) + s(causalness_degree, 
    by = language, k = 10, bs = "cs", m = 1)

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.7121     0.1300   5.476 4.34e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                                           edf Ref.df Chi.sq  p-value    
s(causalness_lagged)                 1.000e+00      1  1.507 0.219621    
s(causalness_degree):languageItalian 1.401e+00      9 13.356 0.000165 ***
s(causalness_degree):languageSpanish 8.654e-05      9  0.000 0.673050    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.171   Deviance explained = 26.4%
-ML = -302.15  Scale est. = 1         n = 99

anova(model5, model7, test="Chisq")

Analysis of Deviance Table

Model 1: proportion_anticausative ~ s(causalness_lagged, k = 10) + s(causalness_lagged, 
    by = language, k = 10, bs = "cs", m = 1) + s(causalness_degree, 
    by = language, k = 10, bs = "cs", m = 1)
Model 2: proportion_anticausative ~ s(causalness_lagged, k = 10) + s(causalness_degree, 
    by = language, k = 10, bs = "cs", m = 1)
  Resid. Df Resid. Dev         Df    Deviance  Pr(>Chi)    
1    94.713    -609.04                                     
2    94.713    -609.04 -8.287e-05 -1.1564e-05 0.0004757 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AIC(model7, model5)

             df       AIC
model7 4.844210 -599.3512
model5 4.844266 -599.3511

Diagnostics of `model5`

# --- 1. Basic GAM checks (smoothness, k-index, residuals) ---
gam.check(model5)


Method: ML   Optimizer: outer newton
full convergence after 10 iterations.
Gradient range [-5.252597e-05,-1.740295e-06]
(score -302.1485 & scale 1).
Hessian positive definite, eigenvalue range [1.740291e-06,67.0835].
Model rank =  46 / 46 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

                                           k'      edf k-index p-value
s(causalness_lagged)                 9.00e+00 1.00e+00    0.94    0.25
s(causalness_lagged):languageItalian 9.00e+00 6.16e-05    0.94    0.17
s(causalness_lagged):languageSpanish 9.00e+00 4.85e-05    0.94    0.26
s(causalness_degree):languageItalian 9.00e+00 1.40e+00    1.03    0.60
s(causalness_degree):languageSpanish 9.00e+00 4.98e-05    1.03    0.56

# --- 2. Concurvity (nonlinear collinearity among smooths) ---
concurvity(model5, full = TRUE)

              para s(causalness_lagged) s(causalness_lagged):languageItalian
worst    0.4633079            1.0000000                            1.0000000
observed 0.4633079            1.0000000                            1.0000000
estimate 0.4633079            0.9986535                            0.9193622
         s(causalness_lagged):languageSpanish
worst                               1.0000000
observed                            1.0000000
estimate                            0.9536625
         s(causalness_degree):languageItalian
worst                               0.7057063
observed                            0.5251884
estimate                            0.2893153
         s(causalness_degree):languageSpanish
worst                               0.6903573
observed                            0.5590216
estimate                            0.3517013

The QQ plot of residuals (gam.check) showed that the observed residuals closely followed the expected theoretical quantiles, with only minor deviations in the tails, suggesting that the beta regression model adequately captured the distributional assumptions.

The residuals–fitted plot showed that most observations clustered around mid-range fitted values (≈0.5), with residuals centered near zero and no clear systematic trends, indicating that the mean–variance structure of the beta GAM was generally appropriate. A descending line of points between fitted values of roughly 0–2, running from residuals > 1 down toward 0, suggests that a small subset of observations exhibits a systematic pattern, likely due to extreme fitted values, influential cases, or responses near the boundaries of the beta distribution. This pattern is localized rather than global, and the remainder of the residual cloud displays no indication of model misspecification.

The histogram of residuals indicates that most residuals are tightly clustered around zero, with only a small number of larger positive or negative residuals. Overall, the distribution appears approximately symmetric with light tails, suggesting no major problems with the residual structure or the appropriateness of the beta regression model.

Basis dimension checks (k-index = 0.94–1.03, all p > 0.2) indicated that the chosen number of knots (k = 10) was adequate, with no evidence of undersmoothing.

Concurvity checks indicated very high redundancy between the main smooth of causalness_lagged and its by-language interactions (concurvity ≈ 0.92–1.0). This suggests that the shared and language-specific smooths are not easily separable, likely because the language-specific effects were negligible. Concurvity among the causalness_degree smooths was low to moderate (≤ 0.56 observed), and therefore not considered problematic.

Visualise the results `model5`

Visualisation of the effect of causalness_lagged, causalness_lagged by language and causalness_degree by language of model5 (now renamed model).

When plotting the group-wise interaction of causalness_lagged we see that Italian and Spanish essentially overlap. It is quite puzzling that the model selection suggested to keep it, as there is clearly no effect. We take it to mean that the real effect is the one of causalness_lagged by itself, it is capturing everything important. The group-specific smooths are essentially flat → not adding meaningful information. Statistically, the likelihood test picks up the tiniest difference, but for interpretation and visualization, it’s safe to consider them redundant.

model <- model5


# ---- 1. Prediction grid for causalness_lagged (shared smooth) ----
newdata_lagged <- tibble(
  causalness_lagged = seq(min(join$causalness_lagged), max(join$causalness_lagged), length.out = 200),
  causalness_degree = mean(join$causalness_degree, na.rm = TRUE),
  language = factor("Italian", levels = levels(join$language))  # shared smooth
)

pred_lagged <- predict(model, newdata_lagged, type = "link", se.fit = TRUE)

newdata_lagged <- newdata_lagged %>%
  mutate(
    fit_link  = pred_lagged$fit,
    se_link   = pred_lagged$se.fit,
    fit_resp  = plogis(fit_link),
    lower_resp = plogis(fit_link - 2 * se_link),
    upper_resp = plogis(fit_link + 2 * se_link)
  )

p_lagged <- ggplot(newdata_lagged, aes(x = causalness_lagged)) +
  geom_line(aes(y = fit_resp), color = "#fc8961", linewidth = 1.2) +
  geom_ribbon(aes(ymin = lower_resp, ymax = upper_resp), fill = "#fc8961", alpha = 0.2) +
  labs(x = "Causalness Lagged", y = "Proportion of Anticausative") +
  ylim(0, 1) +
  theme_minimal()

# ---- 2. Prediction grid for causalness_degree by language (interaction) ----
newdata_degree <- expand.grid(
  causalness_degree = seq(min(join$causalness_degree, na.rm = TRUE), max(join$causalness_degree, na.rm = TRUE), length.out = 200),
  language = levels(join$language)
) %>%
  as_tibble() %>%
  mutate(
    causalness_lagged = mean(join$causalness_lagged, na.rm = TRUE),
    language = factor(language, levels = levels(join$language))
  )

pred_degree <- predict(model, newdata_degree, type = "link", se.fit = TRUE)

newdata_degree <- newdata_degree %>%
  mutate(
    fit_link   = pred_degree$fit,
    se_link    = pred_degree$se.fit,
    fit_resp   = plogis(fit_link),
    lower_resp = plogis(fit_link - 2 * se_link),
    upper_resp = plogis(fit_link + 2 * se_link)
  )

p_degree_int <- ggplot(newdata_degree, aes(x = causalness_degree, y = fit_resp, linetype = language)) +
  geom_line(color = "#fc8961", linewidth = 1.2) +
  geom_ribbon(aes(ymin = lower_resp, ymax = upper_resp), fill = "#fc8961", alpha = 0.2, color = NA) +
  labs(x = "Causalness Degree", y = "Proportion of Anticausative", linetype = "Language") +
  ylim(0, 1) +
  theme_minimal() +
  theme(legend.position = "bottom")



# ---- 4. Prediction grid for causalness_lagged by language (interaction) ----
newdata_lagged_int <- expand.grid(
  causalness_lagged = seq(min(join$causalness_lagged), max(join$causalness_lagged), length.out = 200),
  language = levels(join$language)
) %>%
  as_tibble() %>%
  mutate(
    causalness_degree = mean(join$causalness_degree, na.rm = TRUE),
    language = factor(language, levels = levels(join$language))
  )

pred_lagged_int <- predict(model, newdata_lagged_int, type = "link", se.fit = TRUE)

newdata_lagged_int <- newdata_lagged_int %>%
  mutate(
    fit_link   = pred_lagged_int$fit,
    se_link    = pred_lagged_int$se.fit,
    fit_resp   = plogis(fit_link),
    lower_resp = plogis(fit_link - 2 * se_link),
    upper_resp = plogis(fit_link + 2 * se_link)
  )

p_lagged_int <- ggplot(newdata_lagged_int, aes(x = causalness_lagged, y = fit_resp, linetype = language)) +
  geom_line(color = "#fc8961", linewidth = 1.2) +
  geom_ribbon(aes(ymin = lower_resp, ymax = upper_resp), fill = "#fc8961", alpha = 0.2, color = NA) +
  labs(
    x = "Causalness Lagged",
    y = "Proportion of Anticausative",
    linetype = "Language"
  ) +
  ylim(0, 1) +
  theme_minimal() +
  theme(legend.position = "bottom")


p_lagged

p_lagged_int

p_degree_int