Knn and nystrom#143

Open

huddyyeo wants to merge 116 commits intogetkeops:mainfrom

huddyyeo:knn_and_convos

huddyyeo commented Feb 24, 2021

First request

huddyyeo and others added 10 commits

February 24, 2021 18:11


          Create Nystrom.py

0138a99


          adding code and unit test for nystrom

d73a55b


          added ivf_np tests

ef95f4c


          added tests for ivf

4c4cf47


          added ivf numpy tests

dad6759


          added tests for ivf_pytorch

de88326


          final edit

21207c5


          add empty init files

9f3b308


          make lint happy

bbf9876


          changed default use_gpu setting to false

b685143

jeanfeydy self-requested a review

March 2, 2021 17:35

jeanfeydy self-assigned this

jeanfeydy added the enhancement label

hl-anna and others added 17 commits

March 4, 2021 14:11


          added unit tests for nystrom

9da09b9


          linter

f0da4b1


          removing sklearn function

8c6124b


          applied black linting

e7e980c


          applied black linting

e4be739


          minor changes and black linting

73a58c3


          changing maximum -> max for older torch

70c6718


          updated exp kernel

fab4ae3


          updated exp kernel

50aa46d


          add IVF superclass

17cd242


          typo correction

d541aeb


          Revert "typo correction"

e311bab

This reverts commit d541aeb.


          changing tests

fc60335


          import utils correctly

3686d61


          add lazytensor import to base ivf class

63d1782


          black

86c2380


          add clustering functions as input

huddyyeo and others added 10 commits

April 18, 2021 18:34


          black

95378e9


          updated tutorials


          added nystrom scripts and unit tests

4a73bfd


          updated imports in unit tests

13d52f7


          updated imports and added note to kmeans

c2b12d6


          changed torch init

3d204ba


          changed capitalisation

5d7f1bb


          updated nystrom

4621fd0


          testing updated import structure

b6962bc


          moved imports around again

d30e00f

jeanfeydy self-requested a review

April 22, 2021 15:50

jeanfeydy requested changes

View reviewed changes

Contributor

jeanfeydy left a comment

Fine! The KNN code is now 99% complete, but there remains quite a lot of work to be done on the Nystroem side. We'll talk about it tomorrow: see you soon :-)

pykeops/common/nystrom_generic.py Outdated

		GenericLazyTensor = TypeVar("GenericLazyTensor")


		class GenericNystrom:

Contributor

jeanfeydy Apr 22, 2021

Following the scikit-learn convention, we should probably call everything "Nystroem" instead of "Nystrom" (from classes to file names).

yanielc Apr 23, 2021

yaniel

pykeops/common/nystrom_generic.py Outdated

Comment on lines 31 to 43

+                      n_components  = how many samples to select from data.
+                      kernel  = type of kernel to use. Current options = {rbf:Gaussian,
+                                                                               exp: exponential}.
+                      sigma  = exponential constant for the RBF and exponential kernels.
+                      eps = size for square bins in block-sparse preprocessing.
+                      k_means = number of centroids for KMeans algorithm in block-sparse
+                                     preprocessing.
+                      n_iter = number of iterations for KMeans.
+                      dtype = type of data: np.float32 or np.float64
+                      inv_eps = additive invertibility constant for matrix decomposition.
+                      verbose = set True to print details.
+                      random_state = to set a random seed for the random sampling of the samples.
+                                      To be used when  reproducibility is needed.

Contributor

jeanfeydy Apr 22, 2021

Please note that we follow the Google docstring convention, which is documented here.

pykeops/common/nystrom_generic.py Outdated

+                      self.tools = None
+                      self.LazyTensor = None
+                      self.device = "cuda" if pykeops.config.gpu_available else "cpu"

Contributor

jeanfeydy Apr 22, 2021

Shouldn't we use the device that is used originally by the input data? I don't know how self.device = "cuda" would handle multi-GPU configurations.

yanielc Apr 23, 2021

yaniel

pykeops/common/nystrom_generic.py Outdated

+                      self.verbose = verbose
+                      self.random_state = random_state
+                      self.tools = None
+                      self.LazyTensor = None

Contributor

jeanfeydy Apr 22, 2021

As detailed in the previous review, I think that using self.LazyTensor as an attribute name is a bit dangerous. You could use e.g. self.lazy_tensor instead?

pykeops/common/nystrom_generic.py Outdated

+                      # Set default sigma
+                      # if self.sigma is None and self.kernel == 'rbf':
+                      if self.sigma is None:
+                          self.sigma = np.sqrt(x.shape[1])

Contributor

jeanfeydy Apr 22, 2021

Is this the scikit-learn default? I would have expected a set value, or something like 10% of the diameter of the data.

yanielc Apr 23, 2021

done

pykeops/torch/nystrom/nystrom.py

Comment on lines 105 to 107

+                      self.K_nq = K_nq  # dim: number of samples x num_components
+                      self.K_nq.backend = "GPU_2D"
+                      self.normalization = normalization

Contributor

jeanfeydy Apr 22, 2021

As with NumPy, "GPU_2D" should only be applied on the "transposed" term.

yanielc Apr 25, 2021

yaniel

pykeops/tutorials/knn/plot_nnd_torch.py Outdated

Comment on lines 57 to 60

+              x_LT = LazyTensor(x.unsqueeze(0).to(device))
+              y_LT = LazyTensor(y.unsqueeze(1).to(device))
+              d = ((x_LT - y_LT) ** 2).sum(-1)
+              true_nn = d.argKmin(K=k, dim=1).long()

Contributor

jeanfeydy Apr 22, 2021

For the sake of consistency, it would be better to use the names x_i, y_j, D_ij, indices and add the expected array shapes in the comments.

Gantrithor-AI Apr 23, 2021

done

pykeops/tutorials/knn/plot_nnd_torch.py Outdated

Comment on lines 66 to 82

+              def accuracy(indices_test, indices_truth):
+                  """
+                  Compares the test and ground truth indices (rows = KNN for each point in dataset)
+                  Returns accuracy: proportion of correct nearest neighbours
+                  """
+                  N, k = indices_test.shape
+                  # Calculate number of correct nearest neighbours
+                  accuracy = 0
+                  for i in range(k):
+                      accuracy += torch.sum(indices_test == indices_truth).float() / N
+                      indices_truth = torch.roll(
+                          indices_truth, 1, -1
+                      )  # Create a rolling window (index positions may not match)
+                  accuracy = float(accuracy / k)  # percentage accuracy
+                  return accuracy

Contributor

jeanfeydy Apr 22, 2021

Since we use this function in several tutorials, shouldn't we factor it somewhere in the utils module?

Gantrithor-AI Apr 23, 2021

done for torch utils

Author

huddyyeo Apr 23, 2021

and for numpy utils

pykeops/tutorials/knn/plot_nnd_torch.py Outdated

Comment on lines 98 to 101

+                  x_LT = LazyTensor(x.unsqueeze(0).to(device))
+                  y_LT = LazyTensor(y.unsqueeze(1).to(device))
+                  d = ((x_LT - y_LT) ** 2).sum(-1)
+                  true_nn = d.argKmin(K=k, dim=1).long()

Contributor

jeanfeydy Apr 22, 2021

Since this code is used above, shouldn't we factor it in a function to avoid copy-pastes?

Gantrithor-AI Apr 23, 2021

done

pykeops/tutorials/knn/plot_nnd_torch.py

Comment on lines 114 to 116

+              ########################################################################
+              # NNDescent search using clusters and Manhattan distance
+              # Second experiment with N=$10^6$ points in dimension D=3, with 5 nearest neighbors and manhattan distance.

Contributor

jeanfeydy Apr 22, 2021

Couldn't we factor the timings/display code above in a function, and call it again with a different metric function instead of copy-pasting it below?

Gantrithor-AI Apr 23, 2021

done

Gantrithor-AI and others added 17 commits

April 23, 2021 13:58


          Update plot_nnd_torch.py

129d216


          Update utils.py

ddbdefe

added accuracy to torch tools


          Create utils.py

02f0fee

added accuracy to torch tools


          Update plot_nnd_torch.py

7c793e8

shifted accuracy to torch.utils


          Update plot_ivf_torch.py

48959da

shifted accuracy to torch.utils


          Update plot_nnd_torch.py

b1c3095

factorized timing function


          reorganised accuracy computations

f3ccd59


          added in updates for Nystroem

5390fcc


          Delete nystrom_generic.py

1e23100


          Delete nystrom.py

ce55836


          Delete Nystrom.py

e3b2f07


          Rename nystroem.py to nystrom.py

0d2eba7


          Rename nystroem.py to nystrom.py

cd3fb2d


          Rename nystroem_generic.py to nystrom_generic.py

8c4e079


          rename

bd4bf43


          shifting import

8778cbe


          add packages

ce36828

jeanfeydy mentioned this pull request

maximum inner product search (MIPS) #175

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels