Exploration of more intuitive `.save` designs.

We've gotten some initial feedback from alpha users that in the demo (https://github.com/LineaLabs/lineapy/blob/main/examples/Demo_1_Preprocessing.ipynb), the line

```python

... # some work omitted

cleaned_data.filter(
    regex="Neighborhood=.|Gr_Liv_Area|Garage_Area|SalePrice"
).to_csv("outputs/cleaned_data_housing.csv", index=False)

artifact = lineapy.save(lineapy.file_system, "cleaned_data_housing")
```

Specifically, `lineapy.file_system` is not intuitive. 

I speculate that this is because when the user is thinking about saving the "cleaned_data_housing.csv" from `cleaned_data`, they have to map that activity to a new concept, `lineapy.file_system`.

We initially went with `lineapy.file_system` because it's the most technically succinct way of describing the desired capture mechanism: via side-effects (this is similar to the requirement for asserts https://github.com/LineaLabs/lineapy/issues/449). Note the semantic difference as compared to `lineapy.save(cleaned_data, ....)`, where we save the value of `cleaned_data`, and the slice would end with the final line that last changed `cleaned_data` (specifically in the notebook, `cleaned_data = cleaned_data.drop(columns=Neighborhood_cats[0])`). If we were to slice this into Airflow, the job would be a no-op since no change is persisted or passed to another job.

There are a few different options---all of them will require some form of additional annotation _beyond_ `lineapy.save(cleaned_data, ....)`.

1. `lineapy.save(cleaned_data, include_side_effecs=True)`. Here, we'll include all the code that has side_effects that uses `cleaned_data`.
  * We can even set `include_side_effecs` to `True` by default to further reduce friction for this use case. It's hard to imagine when the user would want to _exclude_ side effects. Though one scenario I can imagine is writing a sample to disk, and then writing the whole thing to a SQL database (but it's unlikely?).
2. Another option is to have a different API that saves the actual value of a something (and implicitly extracts out a process) vs. something that _just_ extracts the process. So something like `lineapy.save_pipeline(cleaned_data, "")` and `lineapy.save_value(cleaned_data, "")`. My sense is that this option might be too indirect and confusing.
  * The benefit here would be that the `.get` would be clearer---we cannot do a `.get` on side effects but can on saved values.
3. We can also save the call, `to_csv`, and we can do so as a decorator to the line (@linea.save). But I'm not a fan because the user would have do the decoration _before_ the invocation of the call, which doesn't go with our philosophy of deployment after the initial invocations.

Another consideration is fine-grained addressability: the downside of options 1 and 2 is that they don't allow us to now do more fine grained side_effect slicing, where as with `lineapy.file_system`, we can change it to something like `lineapy.file_system("outputs/cleaned_data_housing.csv")` or `lineapy.file_system(cleaned_data)` to be more fine grained. Option 3 allows the fine grained annotation directly.

These are just initial thoughts, please help brainstorm! cc @dorx per your request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exploration of more intuitive `.save` designs. #478

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Exploration of more intuitive .save designs. #478

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Exploration of more intuitive `.save` designs. #478