-
Notifications
You must be signed in to change notification settings - Fork 57
Description
We've gotten some initial feedback from alpha users that in the demo (https://github.com/LineaLabs/lineapy/blob/main/examples/Demo_1_Preprocessing.ipynb), the line
... # some work omitted
cleaned_data.filter(
regex="Neighborhood=.|Gr_Liv_Area|Garage_Area|SalePrice"
).to_csv("outputs/cleaned_data_housing.csv", index=False)
artifact = lineapy.save(lineapy.file_system, "cleaned_data_housing")Specifically, lineapy.file_system is not intuitive.
I speculate that this is because when the user is thinking about saving the "cleaned_data_housing.csv" from cleaned_data, they have to map that activity to a new concept, lineapy.file_system.
We initially went with lineapy.file_system because it's the most technically succinct way of describing the desired capture mechanism: via side-effects (this is similar to the requirement for asserts #449). Note the semantic difference as compared to lineapy.save(cleaned_data, ....), where we save the value of cleaned_data, and the slice would end with the final line that last changed cleaned_data (specifically in the notebook, cleaned_data = cleaned_data.drop(columns=Neighborhood_cats[0])). If we were to slice this into Airflow, the job would be a no-op since no change is persisted or passed to another job.
There are a few different options---all of them will require some form of additional annotation beyond lineapy.save(cleaned_data, ....).
lineapy.save(cleaned_data, include_side_effecs=True). Here, we'll include all the code that has side_effects that usescleaned_data.
- We can even set
include_side_effecstoTrueby default to further reduce friction for this use case. It's hard to imagine when the user would want to exclude side effects. Though one scenario I can imagine is writing a sample to disk, and then writing the whole thing to a SQL database (but it's unlikely?).
- Another option is to have a different API that saves the actual value of a something (and implicitly extracts out a process) vs. something that just extracts the process. So something like
lineapy.save_pipeline(cleaned_data, "")andlineapy.save_value(cleaned_data, ""). My sense is that this option might be too indirect and confusing.
- The benefit here would be that the
.getwould be clearer---we cannot do a.geton side effects but can on saved values.
- We can also save the call,
to_csv, and we can do so as a decorator to the line (@linea.save). But I'm not a fan because the user would have do the decoration before the invocation of the call, which doesn't go with our philosophy of deployment after the initial invocations.
Another consideration is fine-grained addressability: the downside of options 1 and 2 is that they don't allow us to now do more fine grained side_effect slicing, where as with lineapy.file_system, we can change it to something like lineapy.file_system("outputs/cleaned_data_housing.csv") or lineapy.file_system(cleaned_data) to be more fine grained. Option 3 allows the fine grained annotation directly.
These are just initial thoughts, please help brainstorm! cc @dorx per your request.