From de8f9bddcbce0e657c80504f3273b9fd27ced30c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Andr=C3=A1s=20Zlinszky?= Date: Thu, 5 Mar 2026 14:27:30 +0100 Subject: [PATCH] more detailed markdown and timestamp listing --- geo/cdse_embeddings_CF.ipynb | 39 +++++++++++++++++++++++++++--------- 1 file changed, 30 insertions(+), 9 deletions(-) diff --git a/geo/cdse_embeddings_CF.ipynb b/geo/cdse_embeddings_CF.ipynb index 5848a85..786a0bc 100644 --- a/geo/cdse_embeddings_CF.ipynb +++ b/geo/cdse_embeddings_CF.ipynb @@ -15,9 +15,14 @@ "source": [ "

Introduction

\n", "\n", - "This notebook presents the use of embeddings from the MajorTom dataset. They were extracted from Sentinel-2 L1C satellite imagery. \n", + "Embeddings are a way to compress the information content in multi-dimensional data using advanced machine learning. The model generates embeddings by analysing the patterns in a spatial neighbourhood of multispectral imagery, and reducing them to a vector for each patch of pixels. This notebook presents the use of embeddings from the MajorTom dataset. They were extracted from Sentinel-2 L1C satellite imagery. \n", "These embeddings can be used for tasks like classification, regression and change detection. The example demonstrates how embeddings facilitate \n", - "efficient processing of large satellite datasets" + "efficient processing of large satellite datasets.\n", + "\n", + "This example notebook runs on the [Core-S2L1C-SSL4EO embedding dataset](https://documentation.dataspace.copernicus.eu/Data/ComplementaryData/Embeddings.html), which uses multi-spectral Sentinel-2 Level-1 C data and the SSL4EO-ResNet50-DINO model. It has global coverage for a single point in time for each tile – timestamp available in the metadata.\n", + "In the notebook, you first set up access to the embeddings via S3, define an AOI and load the corresponding GeoParquet files to memory. Then you build an embedding matrix and compute cosine similarity between the individual tiles, splitting the patches into groups based on their similarity, finding the most similar patches within the clusters, and finally visualizing those patches together to support interpretation of the groups.\n", + "\n", + "In order to run this notebook, you need to select a large server when you open the CDSE Jupyter lab. If you are currently in a small or medium server, you can change by selecting `File/Hub Control Panel` and clicking `Stop my server`. Wait a few seconds until the server stops, then click `Start my server` and select the large server.\n" ] }, { @@ -151,7 +156,9 @@ "\n", "You should configure two variables:\n", "- AWS_ACCESS_KEY_ID\n", - "- AWS_SECRET_ACCESS_KEY" + "- AWS_SECRET_ACCESS_KEY\n", + "\n", + "Once you have these keys, uncomment the second and third line below (delete the \"#\"), and enter your keys within the quotation marks in place of `YOUR ACCESS` and `YOUR KEY` . Now your credentials are configured as variables within the notebook." ] }, { @@ -174,7 +181,8 @@ "

Define your AOI

\n", "\n", "Defining a bounding box and searching for them in a parquet file\n", - "In this example, the data covers the area around Girona. The task is to classify the land into three categories:\n", + "In this example, the data covers the area around Girona. The images within the dataset are from different dates, we will query and print the timestamps to show you when they were collected.\n", + "The task is to classify the land into three categories:\n", "forests, highly urbanized areas, low urbanization/farmland." ] }, @@ -201,9 +209,17 @@ "lat_min, lat_max = 41.9, 42.1" ] }, + { + "cell_type": "markdown", + "id": "7619f3ac", + "metadata": {}, + "source": [ + "In the next cell, you will again need to uncomment the lines starting with `# aws_access_key_id` and `# aws_secret_access_key`. Take care, you will need to do this twice, once in the first block where we set the S3 client and once in the fourth block where we set the File System. You do not need to paste in your credentials again." + ] + }, { "cell_type": "code", - "execution_count": 5, + "execution_count": null, "id": "8f55e6df", "metadata": {}, "outputs": [ @@ -257,7 +273,7 @@ " & (ds.field(\"centre_lat\") >= lat_min)\n", " & (ds.field(\"centre_lat\") <= lat_max)\n", " ),\n", - " columns=[\"grid_cell\", \"centre_lon\", \"centre_lat\"],\n", + " columns=[\"grid_cell\", \"centre_lon\", \"centre_lat\", \"timestamp\"],\n", " )\n", "\n", " if table.num_rows > 0:\n", @@ -271,6 +287,11 @@ " df_result = df_result.drop_duplicates()\n", " print(\"Found data\")\n", "\n", + " # Print a summary of the timestamps found in this Area of Interest\n", + " print(\"\\nTimestamps found in this AOI (Date and number of patches):\")\n", + " # Using value_counts() gives a nice summary if there are multiple dates\n", + " print(df_result[\"timestamp\"].value_counts())\n", + "\n", "else:\n", " print(\"No grid cells were found in the bounding box\")" ] @@ -319,7 +340,7 @@ "source": [ "

Loading filtered parquets

\n", " \n", - "After filtering the relevant Parquet files, we now need to load them from S3" + "After filtering the relevant Parquet files, we now need to load them from S3. Remember to uncomment the lines with `# aws_access_key_id` and `# aws_secret_access_key` again." ] }, { @@ -630,7 +651,7 @@ "\n", "Steps:\n", "\n", - "1. Identifying the correct S3 folder for each parquet file and loading thumbnails for the grid cells\n", + "1. Identifying the correct S3 folder for each parquet file and loading thumbnails for the grid cells. Remember to uncomment the lines with `# aws_access_key_id` and `# aws_secret_access_key` again.\n", "\n", "2. Selecting the top pairs per cluster with the highest cosine similarity\n", "\n", @@ -928,7 +949,7 @@ "\n", "Steps:\n", "\n", - "1. Finding the correct S3 folder for each parquet file and load available thumbnails\n", + "1. Finding the correct S3 folder for each parquet file and load available thumbnails. Remember to uncomment the lines with `# aws_access_key_id` and `# aws_secret_access_key` again.\n", "\n", "2. Picking a few random examples from each cluster and crop the images to the specified area\n", "\n",