diff --git a/Advance analytics course/Appendix-4-OrtizPerezFinalProject.qmd b/Advance analytics course/Appendix-4-OrtizPerezFinalProject.qmd new file mode 100644 index 0000000..2573e5d --- /dev/null +++ b/Advance analytics course/Appendix-4-OrtizPerezFinalProject.qmd @@ -0,0 +1,753 @@ +# Appendix 4 + +### Santiago Ortiz Pérez + +Advance analytics course + +## Exploratory Analysis of Multidimensional Data {.unnumbered} + +For this project I choosed to analyze the department *Norte De Santander* which department code is 54. +We load the whole crime dataset and filter by the department code to choose the data related to our department of interest *Norte De Santander*. + +```{r} +#| echo: true +#| message: false +#| warning: false + +source('setup.R') + +delitos_data = delitos_data[delitos_data$dpto_ccdgo == '54', ] + +dim(delitos_data) +``` + +The database show us 19754 entries on our department. Each row represent one "manzana censal" defined by DANE as the smallest geographic units used for collecting and organizing census data in Colombia. A census block typically represents a city block or a small group of neighboring dwellings in an urban area, and is bounded by streets or other clearly defined physical features. + +For this project, I have chosen to study terrorism, which is labeled as "TR" in the dataset. The data spans from 2022 to 2025 and provides valuable insights into the patterns and frequency of terrorist activity during this period. This type of crime is particularly significant in a specific region of the country that has been known to harbor terrorist groups attempting to gain control over the area. + +```{r} +# Select the relevant columns from the data and drop geometry +data_imp = delitos_data %>% + st_drop_geometry() %>% + select(sum_22TR, sum_23TR, sum_24TR, sum_25TR) +``` + +### Summary + +```{r} +summary(data_imp) +``` + +- *Extremely Low Frequency*: Across all years (2022 to 2025), the median and 3rd quartile values are 0, indicating that most regions or observations recorded no terrorist activity. + +- *Slight Peak in 2022*: The mean value in 2022 (0.002177) is slightly higher than in other years, and the maximum value is 3, suggesting a small concentration of terrorist events in a few specific areas that year. + +- *Skewed Distribution*: The data are heavily skewed, with nearly all values at 0 and only a few non-zero entries, indicating that terrorism incidents are rare and isolated. + +### Boxplot + +```{r} +boxplot(data_imp, + main = "Boxplot de Terrorimo por año", + xlab = "Año", + ylab = "Valor", + col = "lightblue") +``` + +The boxplots for the terrorism data confirm a highly skewed distribution, with most values concentrated at zero and very few outliers representing terrorist activity. We can notice that most observations fall at or near zero, reinforcing that terrorist incidents are rare across the dataset.Some outliers are visible in some years (especially 2022), highlighting isolated regions with higher levels of activity. The interquartile range is flat (Q1 = Q3 = 0) in all years, which means there is no spread among the middle 50% of the data—they all report zero incidents. + + +```{r} +# Contar cuántas entradas son exactamente 0 por año +conteo_ceros = sapply(data_imp, function(x) sum(x != 0, na.rm = TRUE)) + +# Mostrar resultados +print(conteo_ceros) +``` + +We observe that the crime of terrorism occurs in at most 37 census blocks out of a total of 19,754 available in the region. This is confirmed by checking the records with non-zero values in the dataset. + +This finding highlights the low frequency and high geographic concentration of the crime, indicating that only a small portion of the territory has been affected. + +The skewness of the data (right-skewed distribution) will be further examined in later analyses using histograms and dispersion measures. + +To conduct a deeper analysis, we select the year 2022 as the reference point, as it is the year with the highest number of reported incidents. This year will serve as the focus for understanding the dynamics of the crime at its peak. + +### Map of the Crime Data + +The map displays the spatial distribution of reported terrorism crimes, focusing exclusively on municipalities where at least one incident was recorded (non-zero values). By highlighting only these non-zero cases in red, the map avoids visual clutter and directs attention to the specific areas where the crime has actually occurred. + +```{r} +# Filter only municipalities where the crime occurred (sum_22TR > 0) +nonzero_munis <- delitos_data[delitos_data$sum_22TR > 0, ] + +# Create the Leaflet map showing only these municipalities in red +leaflet(nonzero_munis) %>% + addTiles() %>% + addPolygons( + color = "black", weight = 1, opacity = 0.5, + fillColor = "red", fillOpacity = 0.7, + popup = ~paste("Crime Rate:", round(sum_22TR, 2)) + ) %>% + addLegend(colors = "red", labels = "Municipalities with Crime", + title = "Terrorism (2022)", position = "bottomright") + +``` + +From the visualization, it becomes evident that terrorism incidents are highly concentrated and not widespread. Municipalities such as Tibú, El Tarra, Teorama, and San Calixto—located in Norte de Santander—stand out as persistent hotspots. This pattern suggests that terrorism in this region is localized, possibly linked to ongoing conflict dynamics or the presence of illegal armed groups. The key takeaway is that targeted interventions in these few municipalities could have a significant impact on reducing terrorism-related risks, making the map a valuable tool for prioritizing policy responses and further investigation. + +### Persistent over years + +```{r} +# Filtrar municipios con sum_22TR > 0 +munis_22TR <- delitos_data[delitos_data$sum_22TR > 0, "mpio_cdpmp"] %>% st_drop_geometry() + +# Filtrar municipios con sum_23TR > 0 +munis_23TR <- delitos_data[delitos_data$sum_23TR > 0, "mpio_cdpmp"]%>% st_drop_geometry() + +# Filtrar municipios con sum_24TR > 0 +munis_24TR <- delitos_data[delitos_data$sum_24TR > 0, "mpio_cdpmp"]%>% st_drop_geometry() + +# Encontrar municipios que se repiten en los tres periodos +munis_repetidos <- Reduce(intersect, list(munis_22TR$mpio_cdpmp, munis_23TR$mpio_cdpmp, munis_24TR$mpio_cdpmp)) + +# Mostrar municipios repetidos +print(munis_repetidos) + +``` + +Cúcuta, Tibú, and Teorama, consistently report incidents of terrorism each year. This persistent presence indicates that these areas remain hotspots for terrorist activity, highlighting the need for focused security measures and targeted interventions to address and mitigate the underlying causes of terrorism in these regions. + + +### Skewness + +```{r} +skewness(delitos_data$sum_22TR, na.rm = TRUE) + +# skewness +delitos_data %>% + st_drop_geometry() %>% + select(contains('22')) %>% + summarise(across(everything(), ~ skewness(.x, na.rm = TRUE))) %>% + t() %>% + as.data.frame() %>% + tibble::rownames_to_column(var = "Crime Type") %>% + mutate(V1 = round(V1, 2)) %>% + rename(Skewness = V1) %>% + gt() +``` + +All crime types exhibit strong positive skewness, indicating that most municipalities report low or zero counts, while a few municipalities experience very high crime counts. + +The high skewness (29.67) confirms that terrorist crimes are concentrated in very few locations. + +## Kurtosis + +```{r} +kurtosis(delitos_data$sum_22TR, na.rm = TRUE) +``` + +```{r} +# Kurtosis +delitos_data %>% + st_drop_geometry() %>% + select(contains('22')) %>% + summarise(across(everything(), ~ kurtosis(.x, na.rm = TRUE))) %>% + t() %>% + as.data.frame() %>% + tibble::rownames_to_column(var = "Crime Type") %>% + mutate(V1 = round(V1, 2)) %>% + rename(Kurtosis = V1) %>% + gt() +``` + +All crime types show very high kurtosis, indicating distributions heavily dominated by extreme outliers, creating a highly peaked and heavy-tailed distribution. + +For terrorism crimes, a kurtosis above 1000 further confirms that the data is driven by extreme cases in a few locations, which could be key targets for intervention. + +## Coefficient of Variation + + +```{r} +# variation +delitos_data %>% + st_drop_geometry() %>% + select(contains('22')) %>% + summarise( + across( + everything(), + ~ ifelse(mean(.x, na.rm = TRUE) != 0, + sd(.x, na.rm = TRUE) / mean(.x, na.rm = TRUE), + NA), # Compute CV safely + .names = "{col}" + ) + ) %>% + t() %>% + as.data.frame() %>% + tibble::rownames_to_column(var = "Crime Type") %>% + mutate(V1 = round(V1, 2)) %>% + rename(Variation = V1) %>% + gt() +``` + +Most crime types show low to moderate variation (CV between 5 and 18), meaning the relative spread of crime counts is somewhat consistent across locations. + +Terrorism crimes have a high CV (24.66), confirming that terrorism incidents are clustered heavily in certain municipalities, rather than spread evenly. + + +## Covariance Matrix + +```{r} +#| echo: true +#| message: false +#| warning: false +delitos_data %>% + st_drop_geometry() %>% + select(contains("22")) %>% + cov() %>% + round(2) %>% + knitr::kable(digits = 2, caption = "Covariance Matrix") +``` + + +The covariances between TR and the rest of the crime variables were consistently at or near zero, indicating no significant linear correlation with other types of crimes in the dataset. + +This result reflects the nature of terrorism in Norte de Santander, where armed groups frequently target physical infrastructure such as oil pipelines, electrical towers, bridges, and military checkpoints. These acts often take place in intermunicipal roads or remote rural areas, far from populated zones. As such, they differ in location and motivation from crimes like theft, sexual assault, kidnapping, homicide, and domestic violence, which are typically reported in residential neighborhoods. + +Therefore, the lack of correlation is not only a reflection of differing criminal dynamics but also a limitation of the spatial granularity of the data, which may underrepresent terrorism incidents that fall outside typical civilian reporting zones. This highlights the need to consider alternative spatial units or methodologies when analyzing terrorism-related data. + +## Redundant Variables + +Redundant variables provide little additional information due to high correlation with others, leading to multicollinearity in models. + +### Redundant Variables Detection + +```{r} +#| echo: true +#| message: false +#| warning: false + +# Covariance matrix +cm_delitos_data <- delitos_data %>% + st_drop_geometry() %>% + select(contains('22')) %>% + select(-sum_22TR) %>% + #select(-sum_24SE) %>% + #select(-sum_24SS) %>% + cov() + +# Compute eigenvalues and eigenvectors +eigen_results <- cm_delitos_data %>% eigen() + +# Extract eigenvalues and eigenvectors +eigenvalues <- eigen_results$values +eigenvectors <- eigen_results$vectors + +# Display eigenvalues and eigenvectors +print(eigenvalues) +head(eigenvectors) + +# The Smallest Eigenvalues +sort(eigenvalues, decreasing = FALSE) + +# The smallest eigenvalue is approximately zero +smallest_eigenvalue <- min(eigenvalues) +print(smallest_eigenvalue) + +# Corresponding eigenvector +smallest_eigenvector <- eigenvectors[, which.min(eigenvalues)] +print(smallest_eigenvector) + +# Normalize the eigenvector by dividing by the largest absolute value +normalized_eigenvector <- smallest_eigenvector / max(abs(smallest_eigenvector)) +print(normalized_eigenvector) + +# Sorted normalize the eigenvector +sort(abs(normalized_eigenvector), decreasing = T) + +# Get numeric variable names (order matches eigenvector indices) +variable_names <- colnames(cm_delitos_data) + +# Sort normalized eigenvector by absolute contribution (descending order) +sorted_contributions <- sort(abs(normalized_eigenvector), decreasing = TRUE) + +# Get the indices of the top contributions +top_indices <- order(abs(normalized_eigenvector), decreasing = TRUE) + +# Get the names of the top variables +top_variable_names <- variable_names[top_indices] + +# Print the top variable names +print(top_variable_names) +``` + +Based on the eigenvector analysis of the covariance matrix, we identified that the crime variable sum_22SS has the highest absolute contribution to the eigenvector associated with the smallest eigenvalue. This indicates that sum_22SS lies along a direction in the data space with very low variance, meaning it contributes minimal unique information relative to the other crime variables. + +### Regression Analysis + +The regression model returns all coefficients, standard errors, and test statistics as zero or NaN (Not a Number). This indicates that the dependent variable $( \text{sum\_24TR} )$ is likely a linear combination of the independent variables or identical to their sum. Mathematically, if: + +$\text{sum\_24TR} = \sum_{i=1}^{n} \beta_i \cdot X_i$ + +where $( X_i )$ represents the independent variables, then the design matrix $( X )$ is rank deficient (i.e., singular), making it impossible to estimate unique regression coefficients. This leads to zero residuals and undefined test statistics. A possible cause is perfect multicollinearity, meaning the predictor variables are linearly dependent. In such cases, the system has no unique solution, and the regression fails to provide meaningful estimates. + +```{r} +#| echo: true +#| message: false +#| warning: false + +# Fit a regression model to confirm the relationship + +model <- lm(sum_22TR ~ sum_22HA + sum_22EX + sum_22HM + + sum_22LP + sum_22HR + sum_22VI + + sum_22HC + sum_22HP + sum_22DS + sum_22SS + sum_22SE + sum_22HOM, + data = data.frame(delitos_data)) + +summary(model) +``` + +The linear regression results indicate that the coefficients for sum_22HM, sum_22LP, sum_22VI, sum_22HC, and sum_22HOM are statistically significant predictors of sum_22TR. This suggests that these variables have a meaningful relationship with terrorism incidents in the dataset. + +## Global Variability Metric + +The effective variance and effective standard deviation are measures of the overall variability in the dataset. They are derived from the determinant of the covariance matrix, which captures the generalized variance of the data. For log-transformed data, these metrics are computed similarly but on the log-transformed covariance matrix. + +The effective variance is defined as: + +Effective Variance $= \det(\Sigma)^{\frac{1}{p}}$ + +where: + +- $( \Sigma )$ is the covariance matrix. +- $( p )$ is the number of variables. + +The effective standard deviation is given by: + +- Effective Standard Deviation $= \det(\Sigma)^{\frac{1}{2p}}$ + +For log-transformed data, the effective variance is computed as: + +- Log-Transformed Effective Variance $= \det(\log(\Sigma + 1))^{\frac{1}{p}}$ + +Similarly, the log-transformed effective standard deviation is: + +- Log-Transformed Effective Standard Deviation $= \det(\log(\Sigma + 1))^{\frac{1}{2p}}$ + +```{r} +#| echo: true +#| message: false +#| warning: false + +cov_matrix <- delitos_data %>% + st_drop_geometry() %>% + select(contains('22')) %>% + select(-sum_22TR) %>% + cov() + +# Effective Variance +det(cov_matrix)^(1/ncol(cov_matrix)) + +# Log-Transformed Effective Variance +det(log(cov_matrix + 1))^(1/ncol(cov_matrix)) + +# Effective Standard Deviation +det(cov_matrix)^(1/(ncol(cov_matrix) * 2)) + +# Log-Transformed Effective Standard Deviation +det(log(cov_matrix + 1))^(1/(ncol(cov_matrix) * 2)) +``` + +These metrics provide a compact summary of how dispersed the multivariate crime data is across all variables simultaneously. The relatively low effective variance and standard deviation suggest that, when considering all crime variables together (excluding terrorism crimes), the data exhibits modest variability — likely reflecting many zeros or low counts in the dataset. + +## Linear Dependency and Precision Matrix + +Linear dependency in data occurs when some variables can be expressed as linear combinations of others, leading to redundancy. This is identified through the covariance matrix $( \Sigma )$ and its eigenvalues, where a near-zero eigenvalue indicates dependency. + +The precision matrix $( \Sigma^{-1} )$, the inverse of the covariance matrix, quantifies conditional dependencies. It highlights direct variable relationships, with zero entries indicating independence given other variables. These concepts are crucial for multicollinearity detection and improving model interpretability. + +Multicollinearity occurs when predictor variables are highly correlated, making it difficult to isolate their individual effects in a model. + +```{r} +#| echo: true +#| message: false +#| warning: false + +cov_matrix <- delitos_data %>% + st_drop_geometry() %>% + select(contains('22')) %>% + cov() + +# Compute precision matrix +S_inv <- solve(cov_matrix) + +# Display precision matrix (should match example values) +cat("Precision Matrix (S⁻¹):\n") +print(S_inv, digits = 2) + +# Extract correct row components of the selected crime +dependent_variable_index <- 13 + +first_row <- S_inv[dependent_variable_index, ] +print(first_row, digits = 2) + +diag_element <- S_inv[dependent_variable_index, dependent_variable_index] +print(diag_element, digits = 2) + +# Compute regression coefficients +beta_coefficients <- -first_row[-dependent_variable_index] / diag_element +print(beta_coefficients, digits = 2) + +# Compute residual variance +residual_variance <- 1 / diag_element +residual_sd <- sqrt(residual_variance) # Residual standard error + +# Print residual standard error +print(residual_sd, digits = 2) + +# Compute R^2 +r_squared <- 1 - (residual_variance / cov_matrix[dependent_variable_index, dependent_variable_index]) +print(r_squared, digits = 2) + +# Verify with lm() regression +delitos <- delitos_data %>% + st_drop_geometry() %>% + select(contains('22')) %>% + select(-sum_22VI, -sum_22HP, -sum_22DS, -sum_22SE) + +# Fit model +model <- lm(sum_22TR ~ ., data = data.frame(delitos)) +summary(model) +``` + +This global variability measure can help in assessing the overall complexity or “spread” in the data, which is important for further analyses like PCA, clustering, or regression modeling. + +Based on this, it is advisable to keep the variables sum_22HM, sum_22LP, sum_22HC, and sum_22HOM in the model as key explanatory variables for terrorism-related crime. + + +# Hands-on Data Analysis {.unnumbered} + +## Tackling a Critical Challenge: The Proliferation of Spatial Criminal Phenomena + +# Non-Parametric Correlation {.unnumbered} + +Correlation measures the strength and direction of association between two variables. While Pearson's correlation requires a linear relationship and normally distributed data, \emph{Spearman's rank correlation} and \emph{Kendall's tau} are \emph{non-parametric} measures, making them ideal for analyzing data that may not be linear or normally distributed. + +## Spearman's Rank Correlation + +```{r} +#| echo: true +#| message: false +#| warning: false + +#delitos_data <- delitos_data %>% +# select(-sum_22TR) + +delitos_data %>% + st_drop_geometry() %>% + select(contains('22')) %>% + cor(., method = "spearman", use = "complete.obs") %>% + round(., 3) %>% + print(.) %>% + corrplot(., method = "color", title = "Spearman Correlation", mar=c(0,0,1,0)) +``` + +Absolute Spearman's $\rho$ values are less than 0.2, this show us low correlation. + +# Spatial Neighborhood Matrices {.unnumbered} + +## Neighbors Based on Contiguity + +- Queen Contiguity: Two polygons are considered neighbors if they share any common point (i.e., an edge or a vertex). Mathematically, if polygons $p_i$ and $p_j$ touch at any point, then $A\_{ij} = 1$. +- Rook Contiguity: Two polygons are neighbors only if they share a common edge. That is, if polygons $p_i$ and $p_j$ share a boundary segment, then $A\_{ij} = 1$; merely touching at a corner does not count. + + +```{r} +#| echo: true +#| message: false +#| warning: false + +# Create a spatial neighbors list using Queen contiguity +# (i.e., polygons are considered neighbors if they share any point: edge or vertex) +nb <- spdep::poly2nb(delitos_data, queen = TRUE) +head(nb) + +# Replace invalid neighbor entries (i.e., [1] 0) with empty integer vectors +# This ensures compatibility with functions that expect valid neighbor lists only +nb_0 <- lapply(nb, function(x) if(length(x)==1 && x==0) integer(0) else x) + +# Polygons with neighbors +table(sapply(nb_0, length)) + +# Neighbors of Order k Based on Contiguity +# Neighbors of second order +nblags <- spdep::nblag(neighbours = nb, maxlag = 2) + +# Combine neighbors of all orders up to the specified lag (in this case, up to order 2) +# This creates a cumulative neighbor list including first- and second-order neighbors +nblagsc <- spdep::nblag_cumul(nblags) +table(sapply(nblagsc, length)) +``` + +## Neighbors Based on k Nearest Neighbors + +- K-Nearest Neighbors: For each polygon, the ( k ) nearest neighbors are identified based on a distance threshold. +- Distance Threshold: The distance threshold can be defined as a fixed value or as a function of the average distance between polygons. + +\textbf{k-Nearest Neighbors (kNN)} is a method that defines neighbors based on distance rather than contiguity. For each spatial unit $p_i$, the $k$ closest units (according to Euclidean distance or other metric) are selected as neighbors. + +Formally, let $D(p_i, p_j)$ be the distance between polygons $p_i$ and $p_j$. Then, the neighbor set $N_k(p_i)$ is defined as: + +$N_k(p_i) = p_j$ : $p_j$ is among the $k$ nearest polygons to $p_i$. + +This ensures that each polygon has exactly $k$ neighbors, which is useful when spatial units are irregular or disconnected. + + +```{r} +#| echo: true +#| message: false +#| warning: false + +# Compute centroids of the polygons +coo <- st_centroid(delitos_data) + +# Create a neighbor list where each polygon (based on its centroid `coo`) is connected +# to its 3 nearest neighbors using k-nearest neighbors (k = 3) +nb <- knn2nb(knearneigh(coo, k = 3)) # k number nearest neighbors + +# Polygons with neighbors +table(sapply(nb, length)) + +# Subset data to the first 10 polygons +delitos_data_10 <- delitos_data[1:15000, ] + +# Recompute neighbor list for these 10 polygons to avoid index mismatches +nb_10 <- knn2nb(knearneigh(st_centroid(delitos_data_10), k = 3)) + +# Compute centroids for the 10 polygons +coords_10 <- st_coordinates(st_centroid(delitos_data_10)) + +# Plot the first 10 polygons and overlay neighbor connections in red +plot(st_geometry(delitos_data_10), border = "lightgray", main = "First Polygons with 3 Nearest Neighbors") +plot.nb(nb_10, coords_10, add = TRUE, col = "red", lwd = 2) +``` + +## Neighbors Based on Distance + +```{r} +#| echo: true +#| message: false +#| warning: false + +# Create a neighbor list using distance-based contiguity: +# Polygons are considered neighbors if their centroids are within 0.4 units (e.g., degrees) apart +nb <- dnearneigh(x = st_centroid(delitos_data), d1 = 0, d2 = 0.4) + +# Polygons with neighbors +hist(sapply(nb, length)) + +# Subset data to the first 10 polygons +delitos_data_10 <- delitos_data[1:15000, ] + +# Recompute neighbor list for these 10 polygons to avoid index mismatches +nb_10 <- dnearneigh(x = st_centroid(delitos_data_10), d1 = 0, d2 = 0.4) + +# Compute centroids for the 10 polygons +coords_10 <- st_coordinates(st_centroid(delitos_data_10)) + +# Plot the first 10 polygons and overlay neighbor connections in red +plot(st_geometry(delitos_data_10), border = "lightgray", main = "First Polygons with 3 Nearest Neighbors") +plot.nb(nb_10, coords_10, add = TRUE, col = "red", lwd = 2) +``` + +Determining an Appropriate Upper Distance Bound: To ensure that each area in a spatial dataset has at least (k) neighbors, we can determine an appropriate upper distance bound by first computing the (k) nearest neighbors for each area. For example, using the Queen contiguity method, one may use the \textit{spdep::knearneigh()} function with (k=1) to obtain the nearest neighbor for each polygon. This yields a matrix of neighbor IDs, which is then converted into a neighbor list (of class \textit{nb}) via \textit{knn2nb()}. Next, the \textit{spdep::nbdists()} function computes the distances along the links between each area and its neighbor. By summarizing these distances (e.g., using \textit{summary(unlist(dist1))}), we can observe the range of distances. + +```{r} +#| echo: true +#| message: false +#| warning: false + +# Compute k-nearest neighbors: for each polygon centroid, find its 1 nearest neighbor (k = 1) +nb1 <- knn2nb(knearneigh(coo, k = 1)) + +# Calculate the Euclidean distances between each polygon and its nearest neighbor +dist1 <- nbdists(nb1, coo) + +# Summarize all distances to understand the minimum, maximum, and quartiles +summary(unlist(dist1)) + +# Create a distance-based neighbor list: polygons whose centroids are within [0, 1.2] units are considered neighbors +nb <- dnearneigh(x = st_centroid(delitos_data), d1 = 0, d2 = 1.2) + +# Polygons with neighbors +hist(sapply(nb, length)) +``` + +## Neighborhood Matrices + +```{r} +#| echo: true +#| message: false +#| warning: false + +# Spatial weights matrix using Queen contiguity (binary weights) +# 'queen = TRUE' considers shared edges OR vertices as neighbors +nb <- poly2nb(delitos_data, queen = TRUE) + +# Convert the neighbor list to a spatial weights list object +# 'style = "W"' row-standardizes the weights (sums to 1) +# 'zero.policy = TRUE' avoids errors when some polygons have no neighbors +nbw <- spdep::nb2listw(nb, style = "W", zero.policy = TRUE) + +# Spatial weights matrix based on inverse distance values +# Compute centroids of polygons +coo <- st_centroid(delitos_data) + +# Use Queen contiguity again to define neighbors +nb <- poly2nb(delitos_data, queen = TRUE) + +# Compute distances between neighbors based on their centroids +dists <- nbdists(nb, coo) + +# Create inverse distance weights (1/distance) for each pair of neighbors +ids <- lapply(dists, function(x){1/x}) + +# Create a listw object using binary style ("B" = no standardization) +nbw <- nb2listw(nb, glist = ids, style = "B", zero.policy = TRUE) +``` + +# Spatial autocorrelation {.unnumbered} + +## Global Moran’s + +Spatial Weights Matrix (W) + +The \textit{nb2listw(nb, style = "W")} function calculates the spatial weights matrix, often denoted as $W$. This matrix defines the spatial relationships between the polygons. The \textit{style = "W"} argument specifies row-standardization. + +- Let $w_{ij}$ be an element of the matrix $W$. +- If polygon $i$ is a neighbor of polygon $j$, then $w_{ij} = \frac{1}{n_i}$, where $n_i$ is the number of neighbors of polygon $i$. +- If polygon $i$ is not a neighbor of polygon $j$, then $w_{ij} = 0$. +- The diagonal elements, $w_{ii}$, are typically 0 (a polygon is not considered a neighbor of itself). + +In essence, each row of the matrix $W$ represents a polygon, and the entries in that row represent the influence of its neighbors. Row-standardization means that the elements in each row sum to 1. + +Global Moran's I Statistic: \textit{moran.test()} function calculates Moran's I, a measure of global spatial autocorrelation. The formula for Moran's I is: + +$I = \frac{n \sum_{i=1}^{n} \sum_{j=1}^{n} w_{ij} (x_i - \bar{x})(x_j - \bar{x})}{(\sum_{i=1}^{n} \sum_{j=1}^{n} w_{ij}) \sum_{i=1}^{n} (x_i - \bar{x})^2}$ + +Where: + +- $n$ is the number of observations (polygons). +- $x_i$ is the value of the variable of interest (in your case, \texttt{delitos\_data\$sum\_22TR}) for polygon $i$. +- $\bar{x}$ is the mean of the variable $x$. +- $w_{ij}$ is the spatial weight between polygon $i$ and polygon $j$ from the matrix $W$. + +Simplified: + +$I = \frac{\sum_{i=1}^{n} \sum_{j=1}^{n} w_{ij} (x_i - \bar{x})(x_j - \bar{x})}{S^2 \sum_{i=1}^{n} \sum_{j=1}^{n} w_{ij}}$ + +Where + +$S^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2$ + +Moran's I essentially measures the correlation between the values at a location and the values at neighboring locations. + +Interpretation of Moran's I: + +- $I$ ranges from -1 to +1. +- $I > 0$: Positive spatial autocorrelation. Similar values tend to cluster together. +- $I < 0$: Negative spatial autocorrelation. Dissimilar values tend to cluster together. +- $I \approx 0$: Random spatial pattern. + +Hypothesis Testing: + +The \textit{moran.test()} function also performs a hypothesis test to assess the statistical significance of the observed spatial pattern. + +- \textit{Null Hypothesis ($H_0$):} The variable is randomly distributed in space (no spatial autocorrelation). +- \textif{Alternative Hypothesis ($H_a$):} There is spatial autocorrelation (you specified \textit{alternative = "greater"}, so it's testing for \textit{positive} spatial autocorrelation). + +Output of \textit{moran.test()} + +- \textit{gmoran[["estimate"]][["Moran I statistic"]]}: The calculated value of Moran's I. +- \textit{gmoran[["statistic"]]}: The z-score, which measures how far the observed Moran's I is from the expected value under the null hypothesis, in standard deviations. +- \textit{gmoran[["p.value"]]}: The p-value, which is the probability of observing a Moran's + +`I` value as extreme as, or more extreme than, the one calculated, assuming the null hypothesis is true. A small p-value (typically less than 0.05) suggests that you can reject the null hypothesis and conclude that there is statistically significant spatial autocorrelation. + +```{r} +# Compute centroids of the polygons +coo <- st_centroid(delitos_data) + +# Create a neighbor list where each polygon (based on its centroid `coo`) is connected +# to its 3 nearest neighbors using k-nearest neighbors (k = 3) +nb <- knn2nb(knearneigh(coo, k = 3)) # k number nearest neighbors + +# Global Moran's I +# Convert the neighbor list to a listw object +lw <- nb2listw(nb, style = "W") # Use nb2listw + +# Now you can use 'lw' in moran.test +gmoran <- moran.test(delitos_data$sum_22TR, lw, alternative = "greater") + +gmoran +``` + +Terrorism shows a statistically significant tendency to cluster geographically.However, the magnitude of the autocorrelation is small (Moran's I = 0.01), suggesting the clustering effect, while present, is weak. + +The neighbour object has 206 sub-graphs indicating that the spatial neighborhood structure is fragmented — i.e., the network of areas has disconnected clusters (some areas have no mutual neighbors). This might affect the strength and interpretation of spatial relationships. + + +```{r} +gmoran[["estimate"]][["Moran I statistic"]] # Moran's I + +gmoran[["statistic"]] # z-score + +gmoran[["p.value"]] # p-value +``` +Moran's I value: It measures global spatial autocorrelation — how similar or dissimilar nearby values are, compared to a random distribution. Value of 0.0101 shows that there is a very slight positive spatial autocorrelation — areas with similar levels of terrorism incidents are somewhat near each other. + + +Since p < 0.05, the result is statistically significant at the 5% level. This means we can reject the null hypothesis of spatial randomness. There is evidence that terrorism incidents are not randomly distributed, but instead show a mild spatial clustering. + +\textit{Moran's I Monte Carlo Simulation:} The code you provided performs a Monte Carlo simulation to assess the significance of Moran's I. Here's a breakdown: + +Moran's I Monte Carlo Test: The function `moran.mc` performs a Monte Carlo simulation of Moran's I. Instead of relying on the theoretical distribution of Moran's I (which can be complex), it generates a set of random spatial patterns to create an empirical distribution. + +```{r} +#| echo: true +#| message: false +#| warning: false +gmoranMC <- moran.mc(delitos_data$sum_22TR, lw, nsim = 99) +gmoranMC + +hist(gmoranMC$res) +abline(v = gmoranMC$statistic, col = "red") +``` + +```{r} +#| echo: true +#| message: false +#| warning: false + +moran.plot(delitos_data$sum_22TR, lw) +``` + +Monte Carlo simulation confirms our previous results on a true but low positive spatial correlation. + +## Local Moran’s I + +```{r} +#| echo: true +#| message: false +#| warning: false + +lmoran <- localmoran(delitos_data$sum_22TR, lw, alternative = "greater") +head(lmoran) +``` +Ii ≈ 0.00164 → very small positive local autocorrelation. + +Z.Ii ≈ 0.07025 → very low z-scores. + +P-value ≈ 0.472 → not statistically significant. + + + diff --git a/Advance analytics course/Appendix-4.qmd b/Advance analytics course/Appendix-4.qmd index 137002b..64e996d 100644 --- a/Advance analytics course/Appendix-4.qmd +++ b/Advance analytics course/Appendix-4.qmd @@ -1,10 +1,1035 @@ # Appendix 4 {.appendix, .unnumbered} +### Santiago Ortiz Pérez + +Advance analytics course + +## Exploratory Analysis of Multidimensional Data {.unnumbered} + +For this project I choosed to analyze the department *Norte De Santander* which department code is 54. +We load the whole crime dataset and filter by the department code to choose the data related to our department of interest *Norte De Santander*. + +```{r} +#| echo: true +#| message: false +#| warning: false + +source('setup.R') + +delitos_data = delitos_data[delitos_data$dpto_ccdgo == '54', ] + +dim(delitos_data) +``` + +The database show us 19754 entries on our department. Each row represent one "manzanas censales" defined by DANE as the smallest geographic units used for collecting and organizing census data in Colombia. A census block typically represents a city block or a small group of neighboring dwellings in an urban area, and is bounded by streets or other clearly defined physical features. + +For this project, I have chosen to study terrorism, which is labeled as "TR" in the dataset. The data spans from 2022 to 2025 and provides valuable insights into the patterns and frequency of terrorist activity during this period. This type of crime is particularly significant in a specific region of the country that has been known to harbor terrorist groups attempting to gain control over the area. + +```{r} +# Select the relevant columns from the data and drop geometry +data_imp = delitos_data %>% + st_drop_geometry() %>% + select(sum_22TR, sum_23TR, sum_24TR, sum_25TR) +``` + +### Summary + +```{r} +summary(data_imp) +``` + +- *Extremely Low Frequency*: Across all years (2022 to 2025), the median and 3rd quartile values are 0, indicating that most regions or observations recorded no terrorist activity. + +- *Slight Peak in 2022*: The mean value in 2022 (0.002177) is slightly higher than in other years, and the maximum value is 3, suggesting a small concentration of terrorist events in a few specific areas that year. + +- *Skewed Distribution*: The data are heavily skewed, with nearly all values at 0 and only a few non-zero entries, indicating that terrorism incidents are rare and isolated. + +### Boxplot + +```{r} +boxplot(data_imp, + main = "Boxplot de Terrorimo por año", + xlab = "Año", + ylab = "Valor", + col = "lightblue") +``` + +The boxplots for the terrorism data confirm a highly skewed distribution, with most values concentrated at zero and very few outliers representing terrorist activity. We can notice that most observations fall at or near zero, reinforcing that terrorist incidents are rare across the dataset.Some outliers are visible in some years (especially 2022), highlighting isolated regions with higher levels of activity. The interquartile range is flat (Q1 = Q3 = 0) in all years, which means there is no spread among the middle 50% of the data—they all report zero incidents. + + +```{r} +# Contar cuántas entradas son exactamente 0 por año +conteo_ceros = sapply(data_imp, function(x) sum(x != 0, na.rm = TRUE)) + +# Mostrar resultados +print(conteo_ceros) +``` + +We observe that the crime of terrorism occurs in at most 37 census blocks out of a total of 19,754 available in the region. This is confirmed by checking the records with non-zero values in the dataset. + +This finding highlights the low frequency and high geographic concentration of the crime, indicating that only a small portion of the territory has been affected. + +The skewness of the data (right-skewed distribution) will be further examined in later analyses using histograms and dispersion measures. + +To conduct a deeper analysis, we select the year 2022 as the reference point, as it is the year with the highest number of reported incidents. This year will serve as the focus for understanding the dynamics of the crime at its peak. + +### Map of the Crime Data + +The map displays the spatial distribution of reported terrorism crimes, focusing exclusively on municipalities where at least one incident was recorded (non-zero values). By highlighting only these non-zero cases in red, the map avoids visual clutter and directs attention to the specific areas where the crime has actually occurred. + +```{r} +# Filter only municipalities where the crime occurred (sum_22TR > 0) +nonzero_munis <- delitos_data[delitos_data$sum_22TR > 0, ] + +# Create the Leaflet map showing only these municipalities in red +leaflet(nonzero_munis) %>% + addTiles() %>% + addPolygons( + color = "black", weight = 1, opacity = 0.5, + fillColor = "red", fillOpacity = 0.7, + popup = ~paste("Crime Rate:", round(sum_22TR, 2)) + ) %>% + addLegend(colors = "red", labels = "Municipalities with Crime", + title = "Terrorism (2022)", position = "bottomright") + +``` + +From the visualization, it becomes evident that terrorism incidents are highly concentrated and not widespread. Municipalities such as Tibú, El Tarra, Teorama, and San Calixto—located in Norte de Santander—stand out as persistent hotspots. This pattern suggests that terrorism in this region is localized, possibly linked to ongoing conflict dynamics or the presence of illegal armed groups. The key takeaway is that targeted interventions in these few municipalities could have a significant impact on reducing terrorism-related risks, making the map a valuable tool for prioritizing policy responses and further investigation. + +### Persistent over years + +```{r} +# Filtrar municipios con sum_22TR > 0 +munis_22TR <- delitos_data[delitos_data$sum_22TR > 0, "mpio_cdpmp"] %>% st_drop_geometry() + +# Filtrar municipios con sum_23TR > 0 +munis_23TR <- delitos_data[delitos_data$sum_23TR > 0, "mpio_cdpmp"]%>% st_drop_geometry() + +# Filtrar municipios con sum_24TR > 0 +munis_24TR <- delitos_data[delitos_data$sum_24TR > 0, "mpio_cdpmp"]%>% st_drop_geometry() + +# Encontrar municipios que se repiten en los tres periodos +munis_repetidos <- Reduce(intersect, list(munis_22TR$mpio_cdpmp, munis_23TR$mpio_cdpmp, munis_24TR$mpio_cdpmp)) + +# Mostrar municipios repetidos +print(munis_repetidos) + +``` + +Cúcuta, Tibú, and Teorama, consistently report incidents of terrorism each year. This persistent presence indicates that these areas remain hotspots for terrorist activity, highlighting the need for focused security measures and targeted interventions to address and mitigate the underlying causes of terrorism in these regions. + + +### Skewness + +```{r} +skewness(delitos_data$sum_22TR, na.rm = TRUE) + +# skewness +delitos_data %>% + st_drop_geometry() %>% + select(contains('22')) %>% + summarise(across(everything(), ~ skewness(.x, na.rm = TRUE))) %>% + t() %>% + as.data.frame() %>% + tibble::rownames_to_column(var = "Crime Type") %>% + mutate(V1 = round(V1, 2)) %>% + rename(Skewness = V1) %>% + gt() +``` + +All crime types exhibit strong positive skewness, indicating that most municipalities report low or zero counts, while a few municipalities experience very high crime counts. + +The high skewness (29.67) confirms that terrorist crimes are concentrated in very few locations. + +## Kurtosis + +```{r} +kurtosis(delitos_data$sum_22TR, na.rm = TRUE) +``` + +```{r} +# Kurtosis +delitos_data %>% + st_drop_geometry() %>% + select(contains('22')) %>% + summarise(across(everything(), ~ kurtosis(.x, na.rm = TRUE))) %>% + t() %>% + as.data.frame() %>% + tibble::rownames_to_column(var = "Crime Type") %>% + mutate(V1 = round(V1, 2)) %>% + rename(Kurtosis = V1) %>% + gt() +``` + +All crime types show very high kurtosis, indicating distributions heavily dominated by extreme outliers, creating a highly peaked and heavy-tailed distribution. + +For terrorism crimes, a kurtosis above 1000 further confirms that the data is driven by extreme cases in a few locations, which could be key targets for intervention. + +## Coefficient of Variation + +```{r} +#| echo: true +#| message: false +#| warning: false + +# Compute statistics +mean_val <- mean(delitos_data$sum_22TR, na.rm = TRUE) +print(mean_val) +std_dev <- sd(delitos_data$sum_22TR, na.rm = TRUE) +print(std_dev) + +# Compute the range for first standard deviation +lower_bound <- mean_val - std_dev +upper_bound <- mean_val + std_dev +paste0('lower_bound: ', round(lower_bound, 2), ' - upper_bound: ', round(upper_bound, 2)) + +# Count the number of points within 1 standard deviation +within_1sd <- sum(delitos_data$sum_22TR >= lower_bound & delitos_data$sum_22TR <= upper_bound, na.rm = TRUE) +percentage_1sd <- (within_1sd / nrow(delitos_data)) * 100 +paste0('within_1sd: ', round(within_1sd, 2), ' - percentage_1sd: ', round(percentage_1sd, 2)) + +# Create histogram +ggplot(delitos_data, aes(x = sum_22TR)) + + geom_histogram(binwidth = 5, fill = "blue", alpha = 0.5, color = "black") + + + # Add vertical lines for mean, median, and 1st SD + geom_vline(aes(xintercept = mean_val), color = "red", linetype = "dashed", size = 1.2) + + #geom_vline(aes(xintercept = median_val), color = "green", linetype = "dashed", size = 1.2) + + geom_vline(aes(xintercept = lower_bound), color = "purple", linetype = "dashed", size = 1) + + geom_vline(aes(xintercept = upper_bound), color = "purple", linetype = "dashed", size = 1) + + + # Labels and title + labs(title = "Histogram of DELITOS with Mean, and 1SD Range", + x = "AUTOMOTORES Values", y = "Frequency") + + + # Add annotation for 1SD range + annotate("text", x = mean_val, y = 10, + label = paste(round(percentage_1sd, 1), "1SD", sep = ""), + color = "black", size = 5, hjust = 0.5, vjust = -1) + + + theme_minimal() + +# cv +paste0('cv: ', round(std_dev / mean_val * 100), 2) + +``` + +The data is centered close to zero with most points tightly clustered around the mean (almost 100% within one standard deviation), which suggests low absolute variability. + +However, because the mean is almost zero, the relative variability (CV) appears extremely high and isn’t very meaningful in this context. + + +```{r} +# variation +delitos_data %>% + st_drop_geometry() %>% + select(contains('22')) %>% + summarise( + across( + everything(), + ~ ifelse(mean(.x, na.rm = TRUE) != 0, + sd(.x, na.rm = TRUE) / mean(.x, na.rm = TRUE), + NA), # Compute CV safely + .names = "{col}" + ) + ) %>% + t() %>% + as.data.frame() %>% + tibble::rownames_to_column(var = "Crime Type") %>% + mutate(V1 = round(V1, 2)) %>% + rename(Variation = V1) %>% + gt() +``` + +Most crime types show low to moderate variation (CV between 5 and 18), meaning the relative spread of crime counts is somewhat consistent across locations. + +Terrorism crimes have a high CV (24.66), confirming that terrorism incidents are clustered heavily in certain municipalities, rather than spread evenly. + + +## Median Absolute Deviation MAD and MAD/median + +```{r} +# Compute statistics +median_val <- median(delitos_data$sum_22TR, na.rm = TRUE) +print(median_val) +mad_val <- mad(delitos_data$sum_22TR, na.rm = TRUE) # Compute MAD +print(mad_val) + +# Compute the range for first standard deviation +lower_bound <- median_val - mad_val +upper_bound <- median_val + mad_val +paste0('lower_bound: ', round(lower_bound, 2), ' - upper_bound: ', round(upper_bound, 2)) + +# Count the number of points within 1 MAD +within_1mad <- sum(delitos_data$sum_22TR >= lower_bound & delitos_data$sum_22TR <= upper_bound, na.rm = TRUE) +percentage_1mad <- (within_1mad / nrow(delitos_data)) * 100 +paste0('within_1mad: ', round(within_1mad, 2), ' - percentage_1mad: ', round(percentage_1mad, 2)) + +# MAD/Median +paste0('MAD/Median: ', round(mad_val / median_val * 100), 2) +``` + + +Median Absolute Deviation (MAD) and the ratio MAD/median metrics are useful alternatives to standard deviation and coefficient of variation, especially when the data distribution is heavily skewed or contains outliers. + +However, because the crime data exhibits a strong accumulation of zeros the median values are always zero. This accumulation limits the usefulness of median-based measures like MAD and MAD/median, as they do not provide much discriminatory information about variability in this context. + +## Covariance Matrix + +```{r} +#| echo: true +#| message: false +#| warning: false +delitos_data %>% + st_drop_geometry() %>% + select(contains("22")) %>% + cov() %>% + round(2) %>% + knitr::kable(digits = 2, caption = "Covariance Matrix") +``` + +## Covariance Matrix of Log-Transformed Data + +```{r} +#| echo: true +#| message: false +#| warning: false + +# Define the dataset +x <- delitos_data$sum_22TR + +# 1. Compute Raw Arithmetic Mean +arithmetic_mean <- mean(x) +print(arithmetic_mean) + +# 2. Compute Log-Mean (Multiplicative Center) +log_x <- log(x + 1) # Take logarithm of values +head(log_x) +log_mean <- mean(log_x) # Compute mean in log-space +print(log_mean) +log_mean_exp <- exp(log_mean) # Convert back to original scale +print(log_mean_exp) + +# Create the comparison table +comparison_table <- data.frame( + Index = seq_along(x), # Just an index for x-axis + Original_Value = x, + Log_Value = log_x +) + +p1 <- ggplot(comparison_table, aes(x = Original_Value, y = Log_Value)) + + geom_line(color = "gray70", size = 0.7, alpha = 0.5) + # Thin line connecting points + geom_point(alpha = 0.7, color = "blue") + # Scatter points with transparency + labs( + title = "Scatter Plot: Original vs. Log-Transformed Values", + x = "Original Values", + y = "Log-Transformed Values" + ) + + theme_minimal() + +# Add marginal histogram +ggMarginal( + p1, + type = "histogram", # Add marginal histograms + bins = 40, # Number of bins for the histogram + margins = "both", # Add histogram to both x and y margins + size = 5, # Size of the histograms relative to the scatter plot + fill = "gray", # Fill color for the histogram + color = "black", # Outline color for the histogram + alpha = 0.5 # Transparency +) +``` + ```{r} #| include: false -source('setup.R') +# Store values for inline Quarto text +log_values <- paste(round(head(comparison_table$Log_Value), 2), collapse = ", ") +original_values <- paste(head(comparison_table$Original_Value), collapse = ", ") +``` + +Euler steps describe how many multiplicative steps of $( e )$ are needed to reach a given value. + +For example, in our dataset: + +- Original Values: `r original_values`\ +- Log Values: `r log_values` + +Each log-transformed value represents the number of times we need to multiply 1 by $( e )$ to reach the original value: + +$e^ \text{Log Values} = \text{Original Value}$ + +```{r} +#| echo: true +#| message: false +#| warning: false + +#log transformed data +# Compute statistics for raw and log-transformed data +mean_raw <- mean(delitos_data$sum_22TR, na.rm = TRUE) +sd_raw <- sd(delitos_data$sum_22TR, na.rm = TRUE) +mad_raw <- mad(delitos_data$sum_22TR, na.rm = TRUE) + +delitos_data_log <- delitos_data %>% + #mutate(LOG_AUTOMOTORES = log(AUTOMOTORES + 1)) + mutate(LOG_AUTOMOTORES = log1p(sum_22TR)) # log1p(x) = log(1 + x) to handle zeros + +mean_log <- mean(delitos_data_log$LOG_AUTOMOTORES, na.rm = TRUE) +sd_log <- sd(delitos_data_log$LOG_AUTOMOTORES, na.rm = TRUE) +mad_log <- mad(delitos_data_log$LOG_AUTOMOTORES, na.rm = TRUE) + +# Compute statistics for raw and log-transformed data +data.frame( + Measure = c("Mean", "Median", "Standard Deviation", "MAD"), + Raw_Data = c(mean(delitos_data$sum_22TR, na.rm = TRUE), + median(delitos_data$sum_22TR, na.rm = TRUE), + sd(delitos_data$sum_22TR, na.rm = TRUE), + mad(delitos_data$sum_22TR, na.rm = TRUE)), + Log_Transformed_Data = c(mean(delitos_data_log$LOG_AUTOMOTORES, na.rm = TRUE), + median(delitos_data_log$LOG_AUTOMOTORES, na.rm = TRUE), + sd(delitos_data_log$LOG_AUTOMOTORES, na.rm = TRUE), + mad(delitos_data_log$LOG_AUTOMOTORES, na.rm = TRUE))) -# Load the dataset -delitos_data <- st_read("data/spatial/crime_spatial_course.gpkg") -delitos_data <- delitos_data[delitos_data$dpto_ccdgo == '11', ] +# Transform the data to a long format for ggplot +delitos_long <- delitos_data %>% + st_drop_geometry() %>% + select(contains('22')) %>% + pivot_longer(cols = everything(), names_to = "Crime Type", values_to = "Value") + +# Create faceted histograms +ggplot(delitos_long, aes(x = Value)) + + geom_histogram(bins = 30, fill = "skyblue", color = "black", alpha = 0.7) + + facet_wrap(~ `Crime Type`, scales = "free") + # Facet by crime type + theme_minimal() + + labs( + title = "Distributions of Crime Data", + x = "Value", + y = "Frequency" + ) + + theme( + axis.text.x = element_text(size = 5) # Reduce the font size of X-axis text + ) + +# Transform the data to long format and apply log transformation +delitos_long_log <- delitos_data %>% + st_drop_geometry() %>% + select(contains('22')) %>% + mutate(across(everything(), ~ log(.x), .names = "{col}")) %>% # Log transform (log(x + 1) to avoid log(0)) + pivot_longer(cols = everything(), names_to = "Crime Type", values_to = "Log Value") + +# Create faceted histograms for log-transformed values +ggplot(delitos_long_log, aes(x = `Log Value`)) + + geom_histogram(bins = 30, fill = "skyblue", color = "black", alpha = 0.7) + + facet_wrap(~ `Crime Type`, scales = "free") + # Facet by crime type + theme_minimal() + + labs( + title = "Log-Transformed Distributions of Crime Data", + x = "Log Value", + y = "Frequency" + ) + + theme( + axis.text.x = element_text(size = 3) # Reduce the font size of X-axis text + ) + +# Covariance Matrix (Log-Transformed) +delitos_data %>% + st_drop_geometry() %>% + select(contains('22')) %>% + mutate(across(everything(), ~ log(.x+1))) %>% # Log-transform (+1 to handle zeros) + cov() %>% + round(2) %>% + kable(digits = 2, caption = "Covariance Matrix (Log-Transformed)") ``` + +We computed both the covariance matrix and the log-transformed covariance matrix to explore the relationships among different crime types. In both analyses, the covariances between TR and the rest of the crime variables were consistently at or near zero, indicating no significant linear correlation with other types of crimes in the dataset. + +This result reflects the nature of terrorism in Norte de Santander, where armed groups frequently target physical infrastructure such as oil pipelines, electrical towers, bridges, and military checkpoints. These acts often take place in intermunicipal roads or remote rural areas, far from populated zones. As such, they differ in location and motivation from crimes like theft, sexual assault, kidnapping, homicide, and domestic violence, which are typically reported in residential neighborhoods. + +Therefore, the lack of correlation is not only a reflection of differing criminal dynamics but also a limitation of the spatial granularity of the data, which may underrepresent terrorism incidents that fall outside typical civilian reporting zones. This highlights the need to consider alternative spatial units or methodologies when analyzing terrorism-related data. + +## Redundant Variables + +Redundant variables provide little additional information due to high correlation with others, leading to multicollinearity in models. + +### Redundant Variables Detection + +```{r} +#| echo: true +#| message: false +#| warning: false + +# Covariance matrix +cm_delitos_data <- delitos_data %>% + st_drop_geometry() %>% + select(contains('22')) %>% + select(-sum_22TR) %>% + #select(-sum_24SE) %>% + #select(-sum_24SS) %>% + cov() + +# Compute eigenvalues and eigenvectors +eigen_results <- cm_delitos_data %>% eigen() + +# Extract eigenvalues and eigenvectors +eigenvalues <- eigen_results$values +eigenvectors <- eigen_results$vectors + +# Display eigenvalues and eigenvectors +print(eigenvalues) +head(eigenvectors) + +# The Smallest Eigenvalues +sort(eigenvalues, decreasing = FALSE) + +# The smallest eigenvalue is approximately zero +smallest_eigenvalue <- min(eigenvalues) +print(smallest_eigenvalue) + +# Corresponding eigenvector +smallest_eigenvector <- eigenvectors[, which.min(eigenvalues)] +print(smallest_eigenvector) + +# Normalize the eigenvector by dividing by the largest absolute value +normalized_eigenvector <- smallest_eigenvector / max(abs(smallest_eigenvector)) +print(normalized_eigenvector) + +# Sorted normalize the eigenvector +sort(abs(normalized_eigenvector), decreasing = T) + +# Get numeric variable names (order matches eigenvector indices) +variable_names <- colnames(cm_delitos_data) + +# Sort normalized eigenvector by absolute contribution (descending order) +sorted_contributions <- sort(abs(normalized_eigenvector), decreasing = TRUE) + +# Get the indices of the top contributions +top_indices <- order(abs(normalized_eigenvector), decreasing = TRUE) + +# Get the names of the top variables +top_variable_names <- variable_names[top_indices] + +# Print the top variable names +print(top_variable_names) +``` + +Based on the eigenvector analysis of the covariance matrix, we identified that the crime variable sum_22SS has the highest absolute contribution to the eigenvector associated with the smallest eigenvalue. This indicates that sum_22SS lies along a direction in the data space with very low variance, meaning it contributes minimal unique information relative to the other crime variables. + +### Regression Analysis + +The regression model returns all coefficients, standard errors, and test statistics as zero or NaN (Not a Number). This indicates that the dependent variable $( \text{sum\_24TR} )$ is likely a linear combination of the independent variables or identical to their sum. Mathematically, if: + +$\text{sum\_24TR} = \sum_{i=1}^{n} \beta_i \cdot X_i$ + +where $( X_i )$ represents the independent variables, then the design matrix $( X )$ is rank deficient (i.e., singular), making it impossible to estimate unique regression coefficients. This leads to zero residuals and undefined test statistics. A possible cause is perfect multicollinearity, meaning the predictor variables are linearly dependent. In such cases, the system has no unique solution, and the regression fails to provide meaningful estimates. + +```{r} +#| echo: true +#| message: false +#| warning: false + +# Fit a regression model to confirm the relationship + +model <- lm(sum_22TR ~ sum_22HA + sum_22EX + sum_22HM + + sum_22LP + sum_22HR + sum_22VI + + sum_22HC + sum_22HP + sum_22DS + sum_22SS + sum_22SE + sum_22HOM, + data = data.frame(delitos_data)) + +summary(model) +``` + +The linear regression results indicate that the coefficients for sum_22HM, sum_22LP, sum_22VI, sum_22HC, and sum_22HOM are statistically significant predictors of sum_22TR. This suggests that these variables have a meaningful relationship with terrorism incidents in the dataset. + +### Variance Inflation Factors (VIF) + +To confirm, we fit a regression model where one variable $y$ is explained by others: + +$y = \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p + \epsilon$ + +To quantify redundancy, we compute Variance Inflation Factors (VIFs) for each predictor $x_j$: + +$VIF_j = \frac{1}{1 - R_j^2}$ + +where $R_j^2$ is the $R^2$ value from regressing $x_j$ on all other predictors. + +- $VIF_j = 1$ → No multicollinearity.\ +- $VIF_j > 5$ → Moderate multicollinearity.\ +- $VIF_j > 10$ → Severe multicollinearity, indicating redundancy. + +A high VIF suggests that $x_j$ contributes little independent information and may be removed to improve model stability. + +```{r} +#| echo: true +#| message: false +#| warning: false +#| +# Variance Inflation Factors +vif(model) +``` + +The Variance Inflation Factors (VIFs) for all predictors are low, ranging between 1 and 1.4, indicating minimal multicollinearity among the independent variables. This strengthens confidence in the stability and interpretability of the estimated coefficients. + +## Global Variability Metric + +The effective variance and effective standard deviation are measures of the overall variability in the dataset. They are derived from the determinant of the covariance matrix, which captures the generalized variance of the data. For log-transformed data, these metrics are computed similarly but on the log-transformed covariance matrix. + +The effective variance is defined as: + +Effective Variance $= \det(\Sigma)^{\frac{1}{p}}$ + +where: + +- $( \Sigma )$ is the covariance matrix. +- $( p )$ is the number of variables. + +The effective standard deviation is given by: + +- Effective Standard Deviation $= \det(\Sigma)^{\frac{1}{2p}}$ + +For log-transformed data, the effective variance is computed as: + +- Log-Transformed Effective Variance $= \det(\log(\Sigma + 1))^{\frac{1}{p}}$ + +Similarly, the log-transformed effective standard deviation is: + +- Log-Transformed Effective Standard Deviation $= \det(\log(\Sigma + 1))^{\frac{1}{2p}}$ + +```{r} +#| echo: true +#| message: false +#| warning: false + +cov_matrix <- delitos_data %>% + st_drop_geometry() %>% + select(contains('22')) %>% + select(-sum_22TR) %>% + cov() + +# Effective Variance +det(cov_matrix)^(1/ncol(cov_matrix)) + +# Log-Transformed Effective Variance +det(log(cov_matrix + 1))^(1/ncol(cov_matrix)) + +# Effective Standard Deviation +det(cov_matrix)^(1/(ncol(cov_matrix) * 2)) + +# Log-Transformed Effective Standard Deviation +det(log(cov_matrix + 1))^(1/(ncol(cov_matrix) * 2)) +``` + +These metrics provide a compact summary of how dispersed the multivariate crime data is across all variables simultaneously. The relatively low effective variance and standard deviation suggest that, when considering all crime variables together (excluding terrorism crimes), the data exhibits modest variability — likely reflecting many zeros or low counts in the dataset. + +## Linear Dependency and Precision Matrix + +Linear dependency in data occurs when some variables can be expressed as linear combinations of others, leading to redundancy. This is identified through the covariance matrix $( \Sigma )$ and its eigenvalues, where a near-zero eigenvalue indicates dependency. + +The precision matrix $( \Sigma^{-1} )$, the inverse of the covariance matrix, quantifies conditional dependencies. It highlights direct variable relationships, with zero entries indicating independence given other variables. These concepts are crucial for multicollinearity detection and improving model interpretability. + +Multicollinearity occurs when predictor variables are highly correlated, making it difficult to isolate their individual effects in a model. + +```{r} +#| echo: true +#| message: false +#| warning: false + +cov_matrix <- delitos_data %>% + st_drop_geometry() %>% + select(contains('22')) %>% + cov() + +# Compute precision matrix +S_inv <- solve(cov_matrix) + +# Display precision matrix (should match example values) +cat("Precision Matrix (S⁻¹):\n") +print(S_inv, digits = 2) + +# Extract correct row components of the selected crime +dependent_variable_index <- 13 + +first_row <- S_inv[dependent_variable_index, ] +print(first_row, digits = 2) + +diag_element <- S_inv[dependent_variable_index, dependent_variable_index] +print(diag_element, digits = 2) + +# Compute regression coefficients +beta_coefficients <- -first_row[-dependent_variable_index] / diag_element +print(beta_coefficients, digits = 2) + +# Compute residual variance +residual_variance <- 1 / diag_element +residual_sd <- sqrt(residual_variance) # Residual standard error + +# Print residual standard error +print(residual_sd, digits = 2) + +# Compute R^2 +r_squared <- 1 - (residual_variance / cov_matrix[dependent_variable_index, dependent_variable_index]) +print(r_squared, digits = 2) + +# Verify with lm() regression +delitos <- delitos_data %>% + st_drop_geometry() %>% + select(contains('22')) %>% + select(-sum_22VI, -sum_22HP, -sum_22DS, -sum_22SE) + +# Fit model +model <- lm(sum_22TR ~ ., data = data.frame(delitos)) +summary(model) +``` + +This global variability measure can help in assessing the overall complexity or “spread” in the data, which is important for further analyses like PCA, clustering, or regression modeling. + +Based on this, it is advisable to keep the variables sum_22HM, sum_22LP, sum_22HC, and sum_22HOM in the model as key explanatory variables for terrorism-related crime. + + +# Hands-on Data Analysis {.unnumbered} + +## Tackling a Critical Challenge: The Proliferation of Spatial Criminal Phenomena + +# Non-Parametric Correlation {.unnumbered} + +Correlation measures the strength and direction of association between two variables. While Pearson's correlation requires a linear relationship and normally distributed data, \emph{Spearman's rank correlation} and \emph{Kendall's tau} are \emph{non-parametric} measures, making them ideal for analyzing data that may not be linear or normally distributed. + +## Spearman's Rank Correlation + +Spearman's correlation coefficient $\rho$ is based on \emph{ranked} data. For two variables $X$ and $Y$, we replace each observation by its rank. + +$\rho = 1 - \frac{6 \sum d^2}{n(n^2 - 1)}$ + +to compute Spearman's correlation coefficient $\rho$. This measure is based on ranked data. For two variables $X$ and $Y$, we first replace each observation by its rank. Once the data are ranked, Spearman's $\rho$ is computed similarly to Pearson's correlation, but using the ranks instead of the raw values: + +$\rho = \frac{\sum_{i=1}^{n}(r_{X_i} - \bar{r}_{X})(r_{Y_i} - \bar{r}_{Y})}{\sqrt{\sum_{i=1}^{n}(r_{X_i} - \bar{r}_{X})^2}\sqrt{\sum_{i=1}^{n}(r_{Y_i} - \bar{r}_{Y})^2}}$ + +where $r_{X_i}$ is the rank of the $i$-th observation of $X$, $r_{Y_i}$ is the rank of the $i$-th observation of $Y$, and $\bar{r}_{X}$ and $\bar{r}_{Y}$ are the mean ranks of $X$ and $Y$, respectively. + +Spearman's $\rho$ is then computed similarly to Pearson's correlation but on these ranks: + +$\rho = \frac{\sum_{i=1}^{n}(r_{X_i} - \bar{r}_{X})(r_{Y_i} - \bar{r}_{Y})}{\sqrt{\sum_{i=1}^{n}(r_{X_i} - \bar{r}_{X})^2}\sqrt{\sum_{i=1}^{n}(r_{Y_i} - \bar{r}_{Y})^2}}$ + +where $r_{X_i}$ is the rank of the $i$-th observation of $X$, and $r_{Y_i}$ is the rank of the $i$-th observation of $Y$. + +- A value of $\rho$ close to 1 indicates a strong positive monotonic relationship (as one variable increases, so does the other), while a value close to -1 indicates a strong negative monotonic relationship. A value around 0 suggests little or no monotonic association. + +- This method is robust to non-normality and outliers since it relies on the order (ranks) rather than the actual values. + + +```{r} +#| echo: true +#| message: false +#| warning: false + +#delitos_data <- delitos_data %>% +# select(-sum_22TR) + +delitos_data %>% + st_drop_geometry() %>% + select(contains('22')) %>% + cor(., method = "spearman", use = "complete.obs") %>% + round(., 3) %>% + print(.) %>% + corrplot(., method = "color", title = "Spearman Correlation", mar=c(0,0,1,0)) +``` + +Absolute Spearman's $\rho$ values are less than 0.2, this show us low correlation. + +# Spatial Neighborhood Matrices {.unnumbered} + +This section is based on *Spatial statistics for data science theory and practice with R*. See [@Moraga2023]. + +## Neighbors Based on Contiguity + +- Queen Contiguity: Two polygons are considered neighbors if they share any common point (i.e., an edge or a vertex). Mathematically, if polygons $p_i$ and $p_j$ touch at any point, then $A\_{ij} = 1$. +- Rook Contiguity: Two polygons are neighbors only if they share a common edge. That is, if polygons $p_i$ and $p_j$ share a boundary segment, then $A\_{ij} = 1$; merely touching at a corner does not count. + + +```{r} +#| echo: true +#| message: false +#| warning: false + +# Create a spatial neighbors list using Queen contiguity +# (i.e., polygons are considered neighbors if they share any point: edge or vertex) +nb <- spdep::poly2nb(delitos_data, queen = TRUE) +head(nb) + +# Replace invalid neighbor entries (i.e., [1] 0) with empty integer vectors +# This ensures compatibility with functions that expect valid neighbor lists only +nb_0 <- lapply(nb, function(x) if(length(x)==1 && x==0) integer(0) else x) + +# Polygons with neighbors +table(sapply(nb_0, length)) + +# Neighbors of Order k Based on Contiguity +# Neighbors of second order +nblags <- spdep::nblag(neighbours = nb, maxlag = 2) + +# Combine neighbors of all orders up to the specified lag (in this case, up to order 2) +# This creates a cumulative neighbor list including first- and second-order neighbors +nblagsc <- spdep::nblag_cumul(nblags) +table(sapply(nblagsc, length)) +``` + +## Neighbors Based on k Nearest Neighbors + +- K-Nearest Neighbors: For each polygon, the ( k ) nearest neighbors are identified based on a distance threshold. +- Distance Threshold: The distance threshold can be defined as a fixed value or as a function of the average distance between polygons. + +\textbf{k-Nearest Neighbors (kNN)} is a method that defines neighbors based on distance rather than contiguity. For each spatial unit $p_i$, the $k$ closest units (according to Euclidean distance or other metric) are selected as neighbors. + +Formally, let $D(p_i, p_j)$ be the distance between polygons $p_i$ and $p_j$. Then, the neighbor set $N_k(p_i)$ is defined as: + +$N_k(p_i) = p_j$ : $p_j$ is among the $k$ nearest polygons to $p_i$. + +This ensures that each polygon has exactly $k$ neighbors, which is useful when spatial units are irregular or disconnected. + + +```{r} +#| echo: true +#| message: false +#| warning: false + +# Compute centroids of the polygons +coo <- st_centroid(delitos_data) + +# Create a neighbor list where each polygon (based on its centroid `coo`) is connected +# to its 3 nearest neighbors using k-nearest neighbors (k = 3) +nb <- knn2nb(knearneigh(coo, k = 3)) # k number nearest neighbors + +# Polygons with neighbors +table(sapply(nb, length)) + +# Subset data to the first 10 polygons +delitos_data_10 <- delitos_data[1:15000, ] + +# Recompute neighbor list for these 10 polygons to avoid index mismatches +nb_10 <- knn2nb(knearneigh(st_centroid(delitos_data_10), k = 3)) + +# Compute centroids for the 10 polygons +coords_10 <- st_coordinates(st_centroid(delitos_data_10)) + +# Plot the first 10 polygons and overlay neighbor connections in red +plot(st_geometry(delitos_data_10), border = "lightgray", main = "First Polygons with 3 Nearest Neighbors") +plot.nb(nb_10, coords_10, add = TRUE, col = "red", lwd = 2) +``` + +## Neighbors Based on Distance + +```{r} +#| echo: true +#| message: false +#| warning: false + +# Create a neighbor list using distance-based contiguity: +# Polygons are considered neighbors if their centroids are within 0.4 units (e.g., degrees) apart +nb <- dnearneigh(x = st_centroid(delitos_data), d1 = 0, d2 = 0.4) + +# Polygons with neighbors +hist(sapply(nb, length)) + +# Subset data to the first 10 polygons +delitos_data_10 <- delitos_data[1:15000, ] + +# Recompute neighbor list for these 10 polygons to avoid index mismatches +nb_10 <- dnearneigh(x = st_centroid(delitos_data_10), d1 = 0, d2 = 0.4) + +# Compute centroids for the 10 polygons +coords_10 <- st_coordinates(st_centroid(delitos_data_10)) + +# Plot the first 10 polygons and overlay neighbor connections in red +plot(st_geometry(delitos_data_10), border = "lightgray", main = "First Polygons with 3 Nearest Neighbors") +plot.nb(nb_10, coords_10, add = TRUE, col = "red", lwd = 2) +``` + +Determining an Appropriate Upper Distance Bound: To ensure that each area in a spatial dataset has at least (k) neighbors, we can determine an appropriate upper distance bound by first computing the (k) nearest neighbors for each area. For example, using the Queen contiguity method, one may use the \textit{spdep::knearneigh()} function with (k=1) to obtain the nearest neighbor for each polygon. This yields a matrix of neighbor IDs, which is then converted into a neighbor list (of class \textit{nb}) via \textit{knn2nb()}. Next, the \textit{spdep::nbdists()} function computes the distances along the links between each area and its neighbor. By summarizing these distances (e.g., using \textit{summary(unlist(dist1))}), we can observe the range of distances. + +```{r} +#| echo: true +#| message: false +#| warning: false + +# Compute k-nearest neighbors: for each polygon centroid, find its 1 nearest neighbor (k = 1) +nb1 <- knn2nb(knearneigh(coo, k = 1)) + +# Calculate the Euclidean distances between each polygon and its nearest neighbor +dist1 <- nbdists(nb1, coo) + +# Summarize all distances to understand the minimum, maximum, and quartiles +summary(unlist(dist1)) + +# Create a distance-based neighbor list: polygons whose centroids are within [0, 1.2] units are considered neighbors +nb <- dnearneigh(x = st_centroid(delitos_data), d1 = 0, d2 = 1.2) + +# Polygons with neighbors +hist(sapply(nb, length)) +``` + +## Neighborhood Matrices + +```{r} +#| echo: true +#| message: false +#| warning: false + +# Spatial weights matrix using Queen contiguity (binary weights) +# 'queen = TRUE' considers shared edges OR vertices as neighbors +nb <- poly2nb(delitos_data, queen = TRUE) + +# Convert the neighbor list to a spatial weights list object +# 'style = "W"' row-standardizes the weights (sums to 1) +# 'zero.policy = TRUE' avoids errors when some polygons have no neighbors +nbw <- spdep::nb2listw(nb, style = "W", zero.policy = TRUE) + +# Spatial weights matrix based on inverse distance values +# Compute centroids of polygons +coo <- st_centroid(delitos_data) + +# Use Queen contiguity again to define neighbors +nb <- poly2nb(delitos_data, queen = TRUE) + +# Compute distances between neighbors based on their centroids +dists <- nbdists(nb, coo) + +# Create inverse distance weights (1/distance) for each pair of neighbors +ids <- lapply(dists, function(x){1/x}) + +# Create a listw object using binary style ("B" = no standardization) +nbw <- nb2listw(nb, glist = ids, style = "B", zero.policy = TRUE) +``` + +# Spatial autocorrelation {.unnumbered} + +## Global Moran’s + +Spatial Weights Matrix (W) + +The \textit{nb2listw(nb, style = "W")} function calculates the spatial weights matrix, often denoted as $W$. This matrix defines the spatial relationships between the polygons. The \textit{style = "W"} argument specifies row-standardization. + +- Let $w_{ij}$ be an element of the matrix $W$. +- If polygon $i$ is a neighbor of polygon $j$, then $w_{ij} = \frac{1}{n_i}$, where $n_i$ is the number of neighbors of polygon $i$. +- If polygon $i$ is not a neighbor of polygon $j$, then $w_{ij} = 0$. +- The diagonal elements, $w_{ii}$, are typically 0 (a polygon is not considered a neighbor of itself). + +In essence, each row of the matrix $W$ represents a polygon, and the entries in that row represent the influence of its neighbors. Row-standardization means that the elements in each row sum to 1. + +Global Moran's I Statistic: \textit{moran.test()} function calculates Moran's I, a measure of global spatial autocorrelation. The formula for Moran's I is: + +$I = \frac{n \sum_{i=1}^{n} \sum_{j=1}^{n} w_{ij} (x_i - \bar{x})(x_j - \bar{x})}{(\sum_{i=1}^{n} \sum_{j=1}^{n} w_{ij}) \sum_{i=1}^{n} (x_i - \bar{x})^2}$ + +Where: + +- $n$ is the number of observations (polygons). +- $x_i$ is the value of the variable of interest (in your case, \texttt{delitos\_data\$sum\_22TR}) for polygon $i$. +- $\bar{x}$ is the mean of the variable $x$. +- $w_{ij}$ is the spatial weight between polygon $i$ and polygon $j$ from the matrix $W$. + +Simplified: + +$I = \frac{\sum_{i=1}^{n} \sum_{j=1}^{n} w_{ij} (x_i - \bar{x})(x_j - \bar{x})}{S^2 \sum_{i=1}^{n} \sum_{j=1}^{n} w_{ij}}$ + +Where + +$S^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2$ + +Moran's I essentially measures the correlation between the values at a location and the values at neighboring locations. + +Interpretation of Moran's I: + +- $I$ ranges from -1 to +1. +- $I > 0$: Positive spatial autocorrelation. Similar values tend to cluster together. +- $I < 0$: Negative spatial autocorrelation. Dissimilar values tend to cluster together. +- $I \approx 0$: Random spatial pattern. + +Hypothesis Testing: + +The \textit{moran.test()} function also performs a hypothesis test to assess the statistical significance of the observed spatial pattern. + +- \textit{Null Hypothesis ($H_0$):} The variable is randomly distributed in space (no spatial autocorrelation). +- \textif{Alternative Hypothesis ($H_a$):} There is spatial autocorrelation (you specified \textit{alternative = "greater"}, so it's testing for \textit{positive} spatial autocorrelation). + +Output of \textit{moran.test()} + +- \textit{gmoran[["estimate"]][["Moran I statistic"]]}: The calculated value of Moran's I. +- \textit{gmoran[["statistic"]]}: The z-score, which measures how far the observed Moran's I is from the expected value under the null hypothesis, in standard deviations. +- \textit{gmoran[["p.value"]]}: The p-value, which is the probability of observing a Moran's + +`I` value as extreme as, or more extreme than, the one calculated, assuming the null hypothesis is true. A small p-value (typically less than 0.05) suggests that you can reject the null hypothesis and conclude that there is statistically significant spatial autocorrelation. + +```{r} +# Compute centroids of the polygons +coo <- st_centroid(delitos_data) + +# Create a neighbor list where each polygon (based on its centroid `coo`) is connected +# to its 3 nearest neighbors using k-nearest neighbors (k = 3) +nb <- knn2nb(knearneigh(coo, k = 3)) # k number nearest neighbors + +# Global Moran's I +# Convert the neighbor list to a listw object +lw <- nb2listw(nb, style = "W") # Use nb2listw + +# Now you can use 'lw' in moran.test +gmoran <- moran.test(delitos_data$sum_22TR, lw, alternative = "greater") + +gmoran +``` + +Terrorism shows a statistically significant tendency to cluster geographically.However, the magnitude of the autocorrelation is small (Moran's I = 0.01), suggesting the clustering effect, while present, is weak. + +The neighbour object has 206 sub-graphs indicating that the spatial neighborhood structure is fragmented — i.e., the network of areas has disconnected clusters (some areas have no mutual neighbors). This might affect the strength and interpretation of spatial relationships. + + +```{r} +gmoran[["estimate"]][["Moran I statistic"]] # Moran's I + +gmoran[["statistic"]] # z-score + +gmoran[["p.value"]] # p-value +``` +Moran's I value: It measures global spatial autocorrelation — how similar or dissimilar nearby values are, compared to a random distribution. Value of 0.0101 shows that there is a very slight positive spatial autocorrelation — areas with similar levels of terrorism incidents are somewhat near each other. + + +Since p < 0.05, the result is statistically significant at the 5% level. This means we can reject the null hypothesis of spatial randomness. There is evidence that terrorism incidents are not randomly distributed, but instead show a mild spatial clustering. + +\textit{Moran's I Monte Carlo Simulation:} The code you provided performs a Monte Carlo simulation to assess the significance of Moran's I. Here's a breakdown: + +Moran's I Monte Carlo Test: The function `moran.mc` performs a Monte Carlo simulation of Moran's I. Instead of relying on the theoretical distribution of Moran's I (which can be complex), it generates a set of random spatial patterns to create an empirical distribution. + +```{r} +#| echo: true +#| message: false +#| warning: false +gmoranMC <- moran.mc(delitos_data$sum_22TR, lw, nsim = 99) +gmoranMC + +hist(gmoranMC$res) +abline(v = gmoranMC$statistic, col = "red") +``` + +```{r} +#| echo: true +#| message: false +#| warning: false + +moran.plot(delitos_data$sum_22TR, lw) +``` + +Monte Carlo simulation confirms our previous results on a true but low positive spatial correlation. + +## Local Moran’s I + +```{r} +#| echo: true +#| message: false +#| warning: false + +lmoran <- localmoran(delitos_data$sum_22TR, lw, alternative = "greater") +head(lmoran) +``` +Ii ≈ 0.00164 → very small positive local autocorrelation. + +Z.Ii ≈ 0.07025 → very low z-scores. + +P-value ≈ 0.472 → not statistically significant. + + +