Add option for embedding generation upon BioImage Ingestion #482
Add option for embedding generation upon BioImage Ingestion #482
Conversation
|
This pull request has been linked to Shortcut Story #33097: Add option for embedding generation upon ingest. |
| from .helpers import batch | ||
| from .helpers import get_embeddings_uris | ||
| from .helpers import scale_calc | ||
| from .helpers import serialize_filter |
There was a problem hiding this comment.
these helpers don’t seem like they would be public-facing.
|
|
||
| def get_embeddings_uris(output_file_uri: str) -> Tuple[str, str]: | ||
| destination = os.path.dirname(output_file_uri) | ||
| filename = os.path.basename(output_file_uri).split(".") |
There was a problem hiding this comment.
check out os.path.splitext
…n-for-embedding-generation-upon-ingest
| filtered = split_in_patches(image, embedding_level, embedding_grid) | ||
|
|
||
| patches_array = np.array([]) | ||
| for patch in filtered: |
| region_width = level_shape_w // grid_col_num | ||
| # Loop through the image and extract each region | ||
| patches = [] | ||
| for i in range(grid_row_num): |
There was a problem hiding this comment.
This seems like it is going to be very slow?
| np.array(embeddings.shape, dtype="uint32").tofile(f) | ||
| np.array(embeddings).astype("float32").tofile(f) | ||
|
|
||
| vs.ingest( |
There was a problem hiding this comment.
Can you explain a bit the use case that you want to support?
From what I understand this is creating one vector index per image, storing the embedding of different patches of the image. Does this mean that you want to find similar patches within the image? Do you want to do any cross image queries?
There was a problem hiding this comment.
The idea here is to find similar patches across multiple images. We don't have any specific customer request for any of that. The idea behind the patches though is that in this data environment is quite unlikely to find similarities across whole images. Each image is quite different from one another as a whole based on the depicted cell. Also from my knowledge due to the blank space outside the cells captured by sensor it is quite possible to introduce a lot of noise in the model if fed to it as a whole. So my thought was that a future user would like to search for similarities across images of a specific region of the query image, or in a next step to be able to query a region from the viewer and find similar "abnormalities" - that he can justify and I can't- across the images.
|
|
||
| vs.ingest( | ||
| index_type="FLAT", | ||
| index_uri=embeddings_flat_uri, |
There was a problem hiding this comment.
You also need to pass source_uri of the input vectors (tmp_features_file)
This PR:
typesabstraction we are able to enhance this functionality in the future.SupportedExtensionshas been added concerning the file extensions we support with our ingestion function.fmt_version = 3The resulted image is shown below:

Building a local docker image of the UDFs I was able to test it and the results are shown below as well:

This PR should be followed by PR's that will do the following and will be linked upon creation here:
python-udf-imagingdockerfile in TileDB-REST-UDF-DOCKER-IMAGES with the dependencies needed astiledb-vector-searchandtensorflow