Skip to content

Multi session support#673

Draft
SimonHeybrock wants to merge 77 commits intomainfrom
multi-session
Draft

Multi session support#673
SimonHeybrock wants to merge 77 commits intomainfrom
multi-session

Conversation

@SimonHeybrock
Copy link
Member

@SimonHeybrock SimonHeybrock commented Jan 23, 2026

Summary

Adds proper multi-session support to the dashboard so multiple browser tabs/users can view plots simultaneously without interference.

Fixes #609.
Fixes #672.

Architecture changes

  • Two-stage plotter pattern: Plotters now separate compute() (shared, runs in background thread) from present() (per-session, creates Pipe/DynamicMap). This avoids sharing Panel/HoloViews streaming objects across sessions.
  • Explicit layer state machine: LayerStateMachine tracks each plot layer through well-defined states (WAITING_FOR_JOB → WAITING_FOR_DATA → READY → STOPPED/ERROR) with validated transitions.
  • Session infrastructure: SessionRegistry tracks active browser sessions with heartbeat-based cleanup (but see Add session status display to System Status tab and fix stale-session detection #685 for fixes for this mechanism); SessionUpdater handles per-session periodic updates.
  • PlotDataService: Central service for plot layer state that coordinates between background data processing and per-session presentation.

Key learnings

  • pn.state.notification must be used within a session context. Calling it from background threads does not work.
  • Sharing Pipe and DynamicMap between sessions does work if Pipe.send is called from a background thread, but see next point.
  • pn.io.hold() must be used within a session context. If a background thread uses Pipe.send, all sessions see updates but there is no way to update all plots within a session in a single batched step.
  • Therefore, each session needs its own periodic callback that pulls data from the background thread, uses pn.io.hold(), and calls Pipe.send on its own Pipe and DynamicMap.

SimonHeybrock and others added 30 commits January 22, 2026 06:02
Reduced from ~1300 lines to ~280 lines by:
- Removing redundant explanations (Plotter/Presenter was explained 3x)
- Replacing full implementations with interface signatures
- Cutting verbose justifications (thread safety, why subscriptions are bad)
- Removing the Architecture Summary section that repeated earlier diagrams

Retained the essential content:
- Problem statement and root cause
- Critical deferred setup constraint (key insight)
- Architecture diagrams (current vs proposed)
- Key interfaces (signatures only)
- Phased implementation plan

Prompt: Please carefully review docs/developer/plans/multi-session-architecture.md
- we need to hand this off to an implementer and need to ensure it is both
complete and concise. Are there unnecessary details that just add context
pollution and would be easier to figure out during implementation?
Fix the periodic callback bug and add infrastructure for multi-session support.

Phase 1: Fix periodic callback bug in DashboardBase
- Remove shared `_callback` attribute that caused only last tab to work
- Each session now gets its own independent periodic callback
- Register cleanup via pn.state.on_session_destroyed()

Phase 1.5: Implement SessionRegistry
- Track active browser sessions with heartbeat-based cleanup
- Defense-in-depth for when on_session_destroyed fails to fire
- Thread-safe with configurable stale timeout

Phase 2: Implement state stores
- WidgetStateStore: Key-value store with global versioning
- NotificationQueue: Cursor-based queue for per-session notifications
- Enables polling model instead of direct subscription callbacks

Phase 4: Implement PlotDataService
- Stores computed plot state with version tracking
- Enables each session to poll for updates independently

Related: Issue #609 - Multi-session support for plots

Prompt: Implement multi-session architecture per docs/developer/plans/multi-session-architecture.md

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Refactor Plotter to support the two-stage compute/present architecture:

- Add `compute()` method as Stage 1 (data -> HoloViews elements)
- Add `Presenter` protocol for Stage 2 (pipe -> DynamicMap)
- Add `DefaultPresenter` that wraps plotter's compute() in DynamicMap
- Add `create_presenter()` factory method to Plotter base class
- Keep `__call__` as backwards-compatible alias for `compute()`

This enables the multi-session pattern where:
- Shared orchestrator calls plotter.compute() to create shareable PlotState
- Each session creates its own Presenter, Pipe, and DynamicMap
- Session-bound components are created in correct session context

Related: Issue #609 - Multi-session support for plots

Prompt: Implement multi-session architecture per docs/developer/plans/multi-session-architecture.md

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implement SessionUpdater and integrate with dashboard infrastructure:

- Add SessionUpdater class for per-session update management
  - Polls shared state stores (WidgetStateStore, PlotDataService)
  - Processes notification queue for per-session notifications
  - Handles widget and plot update handlers with error isolation
  - Provides batched UI updates via pn.io.hold()

- Integrate SessionUpdater with DashboardBase
  - Create SessionUpdater per session in start_periodic_updates()
  - Store session updaters keyed by session ID
  - Clean up SessionUpdater on session destruction

- Add shared state stores to DashboardServices
  - WidgetStateStore for cross-session widget state
  - PlotDataService for version-tracked plot data
  - NotificationQueue for per-session notifications

The infrastructure is now complete for gradually migrating widgets to
use the polling model instead of direct subscription callbacks.

Related: Issue #609 - Multi-session support for plots

Prompt: Implement multi-session architecture per docs/developer/plans/multi-session-architecture.md

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements Phase 5 (SessionUpdater integration) and Phase 2 (notification
migration) of the multi-session architecture plan.

Plot data flow:
- Extended PlotLayerState with plotter, initial_data, and shared_pipe fields
- PlotOrchestrator stores plot state in PlotDataService instead of creating
  session-bound DynamicMaps directly
- PlotGridTabs uses deferred setup pattern: callbacks only record pending
  setups, actual DynamicMap creation happens in session's periodic callback
- Added data forwarding from shared pipes to per-session pipes
- Added late subscriber support: new sessions scan PlotDataService for
  existing layers on initialization

Notification migration:
- JobOrchestrator pushes command notifications to NotificationQueue instead
  of calling widget callbacks directly
- Removed on_command_success/on_command_error from WidgetLifecycleCallbacks
- Removed direct pn.state.notifications calls from WorkflowStatusWidget
- SessionUpdater polls NotificationQueue and shows notifications in correct
  session context

SessionUpdater enhancements:
- Added register_custom_handler() for session-specific periodic processing
- Custom handlers run in correct session context during periodic_update()

This enables multiple browser tabs to:
- Each receive their own plot updates independently
- All see command notifications (start/stop/reset success/error)
- Join late and still see existing plots

Prompt: Implement the following plan: Multi-Session Architecture: Wire Up Infrastructure
Extract session-specific plot management from PlotGridTabs into a dedicated
SessionPlotManager class. This reduces PlotGridTabs complexity by ~40% and
provides a clean, testable abstraction for per-session components.

Key changes:
- Add SessionPlotManager to handle per-session Pipes, Presenters, DynamicMaps
- Add ROIPresenter class for proper Presenter pattern with ROI plotters
- Add PlottingController.create_plotter() for dependency injection
- Remove legacy path from PlotOrchestrator (PlotDataService now required)
- Use version-based updates instead of object identity checks

Update tests to reflect that LayerState.plot is now None (deferred to
per-session creation via SessionPlotManager).

Prompt: Please read docs/developer/plans/multi-session-architecture.md - the
implementer handed us the changeset. Tested this in the UI, and apart from
minor issues the key requirements are fulfilled. However, if we look at the
concrete implementation changes certain things are too complicated. Please
analyze and think carefully if this can be simplified.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The previous commit broke plot updates because SessionPlotManager used
PlotDataService.get_updates_since() which only tracks initial setup,
not ongoing streaming data.

Restore the original data forwarding mechanism:
- Track shared_pipe references when setting up layers
- Forward data using object identity checks on shared_pipe.data
- This is how streaming updates propagate to each browser session

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Ensure compute() is never called from a session since it could modify
shared state (like autoscalers). Instead, compute() runs once in shared
context, and per-session Presenter.present() creates DynamicMaps that
render from pre-computed state.

Key changes:
- Add SlicerState dataclass to hold computed data and pre-computed clim
- SlicerPlotter.compute() returns SlicerState with global color limits
- SlicerPlotter.render_slice() renders per-session slices without
  modifying shared state
- Add SlicerPresenter for per-session kdims and DynamicMap creation
- DefaultPresenter is now simple pass-through (receives pre-computed
  HoloViews elements)
- Remove kdims property and initialize_from_data() from base Plotter
- Add on_data_update callback to data pipeline for triggering recompute
Replace the two-phase callback system (on_first_data/on_data_update) with
a single on_data callback. This removes unnecessary complexity from the
multi-session architecture:

- DataSubscriber: Single on_data callback, remove vestigial Pipe protocol
- StreamManager: Remove pipe_factory, simplify make_stream() signature
- PlottingController: setup_pipeline() uses single on_data callback
- PlotOrchestrator: Create plotter eagerly when workflow commits, not when
  data arrives. Remove the layer_plotter closure hack.
- DashboardServices: Remove pipe_factory parameter

Key behavioral changes:
- Plotter creation happens when workflow commits (eager), not on first data
- on_cell_updated notifications no longer fire on data arrival - sessions
  poll PlotDataService directly instead
- Source name filtering is handled by the data pipeline (on_data only
  fires when subscribed keys have data)

Update tests to reflect new semantics: eager plotter creation, changed
notification counts, and simplified interfaces.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, each browser session's periodic callback called
orchestrator.update(), but BackgroundMessageSource drains the entire
message queue on get_messages(). This meant only whichever session
called first would actually process messages, while others got empty
queues.

Now DashboardServices spawns a dedicated background thread that runs
orchestrator.update() every 0.5s, separate from per-session callbacks.
Each session's callback only polls for UI changes via SessionUpdater.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Custom handlers like _poll_for_plot_updates update plot pipes and were
running outside the pn.io.hold() block, causing staggered rendering.
The original code only entered the hold block when there were polled
changes, but custom handlers need to run every update cycle regardless.

Now all UI updates (polled changes and custom handler work) are batched
together in a single hold block.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The orchestrator update loop previously waited a fixed time AFTER each
update, causing drift when processing took variable time. This led to
irregular perceived update intervals (e.g., 1.5s, 0.3s) due to poor
alignment with session polling.

Now uses fixed-rate scheduling that accounts for processing time,
maintaining consistent 200ms intervals. If processing takes longer than
the interval, the schedule resets to avoid permanent backlog.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Instead of always sleeping for a fixed interval, only sleep when
update() completes quickly (< 10ms), indicating no real work was done.
When there's actual data to process, loop immediately to drain the
queue. This avoids busy-waiting when idle while being responsive when
data is flowing.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Move session cleanup from per-session periodic callbacks to the shared
background thread that drives the orchestrator. This fixes two issues:

- Redundancy: cleanup_stale_sessions was called N times per cycle (once
  per browser tab) instead of once
- No cleanup after last session: when all sessions closed cleanly, the
  background thread continued but cleanup stopped

Changes:
- SessionRegistry now stores SessionUpdater instances and handles their
  cleanup automatically on unregister or stale detection
- DashboardServices owns SessionRegistry and calls cleanup from _update_loop
- DashboardBase simplified: removed _session_updaters dict, _on_session_cleanup
  callback, and duplicate cleanup logic

Prompt: I wonder why we call cleanup_stale_sessions from every session's
periodic callback. This seems (a) redundant and (b) will stop cleanup once
the last session is destroyed. Shouldn't it be called from the background
thread that also drives the orchestrator?

Follow-up: yes, move it there; related, I wonder why we store
self._session_updaters and have both _cleanup_session and _on_session_cleanup;
does this belong more into SessionRegistry?

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
SessionUpdater now registers itself with the SessionRegistry in __init__,
removing the need for a separate register() call. This makes the setup
flow clearer: creating a SessionUpdater = starting a session.

Prompt: The session setup code in start_periodic_updates feels odd. We first
pass the registry into the updater and then register the updater in the
registry. Should _get_session_id move into the registry and the setup into
a factory method on the registry that can auto-register? Or can you think
of a way to make this cleaner and more obvious?

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Flip the lifecycle order so SessionUpdater is created before widgets.
This allows widgets to receive session_updater in their constructor
and register handlers directly, eliminating post-hoc wiring.

Changes:
- DashboardBase: create_main_content now receives session_updater
- DashboardBase: extracted _create_session_updater() and
  _start_periodic_callback() from the old start_periodic_updates()
- PlotGridTabs: now requires plot_data_service and session_updater
  in constructor, registers handler in __init__
- Removed PlotGridTabs.set_session_services() - no longer needed
- Removed hasattr check for _plot_grid_tabs in base class

This design makes dependencies explicit: widgets declare what they need
in their constructor. Adding new session-aware widgets is now simple -
just add session_updater parameter, no base class changes needed.

Prompt: If we assume that there may be more widgets that need to hook
into the session updater in the future, what would you recommend?
Think carefully, we want to get this right as it is central to our
architecture.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
WidgetStateStore was planned for polling-based widget state synchronization
but was never wired up in production code. The existing WidgetLifecycleCallbacks
mechanism works correctly for widget synchronization, so we keep the simpler
callback approach.

Removed:
- WidgetStateStore, StateKey, and VersionedState from state_stores.py
- widget_state_store creation from DashboardServices
- Widget state polling from SessionUpdater
- Related tests

The dual mechanism (NotificationQueue for toasts, callbacks for widgets) is
intentional: toasts must be triggered from session context via polling, while
widget callbacks work correctly with Panel's session handling.

---
Prompt: Remove unused WidgetStateStore and update review notes
Separate notification queue into its own file for better organization:
- Create notification_queue.py with NotificationQueue, NotificationEvent,
  NotificationType
- Update state_stores.py to only contain PlotDataService and PlotLayerState
- Create notification_queue_test.py with notification tests
- Update imports in all consuming modules

---
Prompt: Move notification queue to its own file (same for tests)
Simplify the plot state architecture by:
- Merging LayerState into PlotLayerState (single source of truth)
- Routing static plots through PlotDataService like streaming plots
- Simplifying CellUpdatedCallback to just signal "config changed"

Key changes:

PlotLayerState (state_stores.py):
- Add error/stopped fields for lifecycle state
- Add helper methods: create_entry(), set_error(), set_stopped()

PlotOrchestrator:
- Remove LayerState class entirely
- Rename get_cell_state() to get_cell() (returns just config)
- Remove _compose_cell_plot() (widgets compose via session DMaps)
- Store static plots in PlotDataService via StaticPresenter

CellUpdatedCallback:
- Remove layer_states and plot parameters
- Callback now just signals config changed, subscribers query
  PlotDataService for runtime state

Static plots:
- Add StaticPresenter class that returns hv.Element directly
- Add create_presenter() method to StaticPlotter base class

PlotGridTabs:
- Add _get_layer_states() helper to query PlotDataService
- Update all methods to get state from PlotDataService
- Update type hints to include hv.Element for static plots

SessionPlotManager:
- Update type hints to allow hv.Element return from get_dmap()

Prompt: Implement the following plan: <plan for unifying LayerState and PlotDataService>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add check for state.state being None before attempting to set up a layer.
This handles the case where an error occurs after data was previously
received - the plotter is preserved but state is cleared. Without this
check, the session would try to create a Pipe with None data.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously there were two different LayerId types:
- plot_data_service.LayerId = NewType('LayerId', str)
- plot_orchestrator.LayerId = NewType('LayerId', UUID)

This required importing as StateLayerId and constant boilerplate
conversions: StateLayerId(str(layer_id))

Now there is a single LayerId type in plot_data_service based on UUID,
which is imported by plot_orchestrator and other modules.

Prompt: Find out why we have two different LayerId types, one often imported as StateLayerId.
Follow-up: unify them to LayerId (UUID) and commit
Extract SlicerState, SlicerPresenter, and SlicerPlotter from plots.py
into a new slicer.py module. Update imports in plotting.py and tests.

Prompt: Please move SlicerPresenter and SlicerPlotter to their own file.
- Remove @AbstractMethod from Plotter.plot(), provide default NotImplementedError
- Remove ABC base class from Plotter (no longer has abstract methods)
- Move render_slice, _slice_data, _flatten_outer_dims from SlicerPlotter to SlicerPresenter
- SlicerPresenter now takes base_opts dict instead of plotter reference
- Pre-prepare 3D data in compute() (dtype conversion + log masking)
- Update tests to use presenter for render_slice testing

This improves separation of concerns:
- SlicerPlotter (shared): config setup + compute() for data preparation
- SlicerPresenter (per-session): render_slice + kdims + DynamicMap

Prompt: Consider slicer.py and the split between the plotter (shared compute
stage) and presenter (per-session second stage). We need to think about
whether more work that is performed in render_slice (called per-session)
could be performed in the shared compute phase.
SlicerPresenter was only receiving base_opts but not sizing_opts,
which caused aspect ratio and responsive settings to not be applied.

Now passes both base_opts and sizing_opts to the presenter.
When changing plot settings (e.g., scale from LINEAR to LOG), the plot
would freeze in its previous state and stop updating. Sometimes switching
windows back and forth would "unfreeze" it, and there were cases where
the state was out of sync.

Root cause: SessionPlotManager caches DynamicMaps/Pipes/Presenters keyed
by layer_id. When config changed, the same layer_id was reused, so the
cached old presenter (with old params) continued to be used.

Solution: Treat config changes as remove+add with a NEW layer_id. When
update_layer_config() is called:
1. Old layer_id is fully cleaned up (removed from PlotDataService)
2. A fresh layer_id is generated for the updated layer
3. Session caches for old layer_id become orphaned and are cleaned up
4. New layer_id has no cache, so fresh components are created naturally

This approach is simpler than tracking config versions or plotter identity
because cache invalidation happens automatically via key expiration.

DataService triggers subscribers immediately with existing data on
registration, so the plot updates immediately without waiting for new
stream data.

Prompt: Please investigate the following bug: When changing the settings
of an existing plot (which should recreate it with new config), the plot
just freezes in previous state and stops updating.

Follow-up: Simplified from plotter identity tracking to new layer_id
approach per discussion that treating config changes as remove+add is
cleaner and more fool-proof.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
SimonHeybrock and others added 13 commits January 28, 2026 09:13
Take snapshots of job_statuses dictionary values before iterating to
prevent "dictionary changed size during iteration" errors when
background Kafka threads modify the dictionary concurrently.

Prompt: I got a one-off exception in the dashboard logs: RuntimeError: dictionary changed size during iteration

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When using 'nearest' mode in _make_lookup, the code tried to convert
the axis dimension coordinate to float64, but scipp's astype('float64')
doesn't support datetime64 dtypes. This caused sporadic exceptions when
the axis dimension was 'time' with datetime64 values.

Skip the float64 conversion for datetime64 coordinates since they are
accepted as-is by scipp's lookup in nearest mode. The conversion was
only needed for int coords (rejected since scipp v25.11.0).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Restore pn.state.execute() deferral in _poll_for_plot_updates() to prevent
race condition between Bokeh model updates and widget insertion.

The issue occurred during tab switching with workflow stop/start cycles:
1. update_pipes() calls pipe.send() triggering DynamicMap/Bokeh model updates
2. Widget insertion happened immediately, racing with model processing
3. Panel tried to access models that were being removed, causing KeyError

Deferring insertion to the next event loop iteration gives Bokeh time to
process pending updates before widget replacement.

This reverts the problematic part of commit 18fd8a6.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When a workflow restarts, PlotOrchestrator creates a new DataSubscriber
but the old one remained in DataService._subscribers, causing a memory
leak. Old subscribers wouldn't receive data (keys don't match new
job_number) but accumulated over time.

Changes:
- Add DataService.unregister_subscriber() to remove subscribers
- Update StreamManager.make_stream() to return the subscriber
- Update PlottingController.setup_pipeline() to return the subscriber
- Track data subscriptions per layer in PlotOrchestrator._data_subscriptions
- Clean up old subscription in _on_all_jobs_ready() before creating new one
- Clean up subscription in _unsubscribe_layer() when layer is removed

Prompt: 9451e56 should have fixed job-transition issues, but I am still
seeing issues in some cases. Can you investigate and think if the mechanism
has any other holes?

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Converting WeakSet to list before iteration prevents RuntimeError
that can occur when the garbage collector removes dead references
during iteration.

Prompt: fix the WeakSet issue
This refactors the plot layer lifecycle management from implicit state
(derived from multiple fields) to an explicit state machine with
well-defined states and transitions:

- WAITING_FOR_JOB: Layer configured, workflow not running
- WAITING_FOR_DATA: Job started, plotter exists, no data yet
- READY: Has data, can render plot
- STOPPED: Workflow stopped (may have stale data)
- ERROR: Error occurred

Key changes:

- Add LayerState enum and LayerStateMachine class with validated
  transitions: job_started(), data_arrived(), job_stopped(),
  error_occurred()

- Add version counter that increments on every state transition,
  enabling polling-based change detection

- Update PlotOrchestrator to use state machine methods instead of
  set_plotter/set_error/set_stopped

- Remove callback-based _notify_cell_updated calls for state changes
  (stopped, error). UI now detects state changes via polling.
  Keep callbacks only for structural changes (new cells).

- Update PlotGridTabs to detect version changes in polling loop,
  replacing the callback-based _on_cell_updated for state changes

- Update placeholder content to use match statement on state enum
  for clearer status display logic

- Update SessionPlotManager._setup_layer to use has_displayable_plot()
  which checks for READY or STOPPED state with cached data

This fixes the bug where "Workflow ended" briefly appeared during
workflow restart, because job_started() now properly increments
version and polling detects the state change.

---

Prompt: Implement the following plan: [Explicit State Machine for Plot Layer Lifecycle]
Simplify the state machine implementation by removing the backward-compatible
PlotLayerState facade class. PlotDataService now stores LayerStateMachine
instances directly.

Changes:
- Remove PlotLayerState dataclass facade
- Remove unused ensure_layer method
- Remove all logging from plot_data_service.py
- Update tests to use new API (job_started, error_occurred, job_stopped)
- Use .error_message instead of .error, check state enum instead of .stopped

Prompt: "Keeping PlotLayerState as a facade is not what we want, no backward
compat needed!" followed by "ensure_layer looks unused" and "Please also
remove all the logging from plot_data_service.py, we will add this on a
per-need basis."

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Forgot to include in previous commit - updates imports and type annotations
from PlotLayerState to LayerStateMachine, and .error to .error_message.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove redundant DMap transition check in _poll_for_plot_updates
  (version tracking now handles all state change detection)
- Add logging warnings for invalid state transitions in LayerStateMachine
- Make READY case in placeholder content defensive with warning log

Original prompt: Help me understand the PlotOrchestrator/PlotGridTabs mechanism
and critical review of state machine implementation (+400 lines question).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The on_data() callback in plot_orchestrator is invoked on every data
update, calling data_arrived() each time. However, the state machine
only needs to transition once from WAITING_FOR_DATA to READY. After
that, subsequent data arrivals are handled by plotter.compute() which
updates the plot state directly.

Make data_arrived() a silent no-op when already in READY state, since
this is expected behavior for continuous data streams.

Prompt: The recent commits refactor the plot update mechanism to a state
machine. I get warnings in the logs (even though the plot updates just
fine): Invalid transition: data_arrived() called in state READY
(expected WAITING_FOR_DATA)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When adding a new plot to an already-running workflow, the plot would
appear briefly and then be recreated, causing a visible flicker.

Root cause: _on_cell_updated creates the widget but doesn't record the
layer versions in _last_seen_versions. On the next poll cycle (~100ms),
_poll_for_plot_updates sees last_version=None for the new layers and
triggers a redundant rebuild.

Fix: Record current versions for all layers in _on_cell_updated after
creating the widget, preventing the polling loop from triggering
duplicate rebuilds.

Prompt: I see another issue: When adding a new plot where the job is
already streaming data, the plot appears for a fraction of a second,
and is then instantly recreated a second time (afterwards it is stable).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Two issues prevented static overlays from displaying:

1. Missing UI notification: _subscribe_layer_to_workflow returned early
   for static layers without calling _notify_cell_updated, so sessions
   were never informed to display the overlay.

2. Missing cached state: create_static_plot() returned the element but
   didn't store it. has_displayable_plot() returned False because
   _cached_state was None, so SessionPlotManager skipped setup.

Fixes:
- Add _notify_cell_updated call after creating static layer plot
- Capture create_static_plot() return value and call _set_cached_state()

Prompt: Next problem: Static overlays are not showing up any more.
Follow-up: This is not working. Note that the toolbar (header) for the
static overlay appears just fine, but it is not shown in the composed
plot.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Layers were previously added to the plot grid config before PlotDataService
knew about them, requiring `state is None` checks throughout widget code.

Now layers are registered in PlotDataService before widgets are notified:
- Add layer_added() for dynamic layers (WAITING_FOR_JOB state)
- Add static_layer_ready() for static overlays (READY state)
- Call these methods in _subscribe_layer before _notify_cell_updated()

This means WAITING_FOR_JOB state is now actually used (was dead code before),
and widget code can rely on state always being present:
- Type signatures simplified from LayerStateMachine | None to LayerStateMachine
- _get_layer_states() raises RuntimeError if state missing (catches bugs)
- Removed state is None checks from _create_layer_toolbars and
  _create_placeholder_content

Also made plot_data_service required in PlotOrchestrator (was optional with
None default, but docstring already said "required for dashboard to function").

---
Prompt: Help me think through _create_layer_toolbars. Why do we need to handle
`state is None` and why does that correspond to `Waiting for data...`?

Follow-up: User agreed that STATIC==READY makes sense and that registering
layers before notifying widgets is cleaner.
Base automatically changed from improved-logging to main January 29, 2026 10:04
SimonHeybrock and others added 12 commits January 29, 2026 10:32
Clarify that version is needed to detect plotter replacements when
job_started() is called while already in WAITING_FOR_DATA state -
the state doesn't change but UI still needs to rebuild.

Prompt: "Do we need version in LayerStateMachine or can PlotGridTabs store
latest_state instead of latest_version? What about layer reconfig, does it
remove and create a new layer? Are there other 'self' transitions where we
still want to increment the version to effect a widget recreation?"
Follow-up: "Great analysis, can you add some clarifying info in the docstring
(or code comment)?"
Resolved conflicts in:
- plot_orchestrator.py: Keep PlotDataService-based state management,
  remove _compose_cell_plot (composition now happens in widget layer)
- plot_grid_tabs.py: Combine imports (PRIMARY + PlotDataService types)
- plot_grid_tabs_test.py: Remove redundant fixtures, use data_service
PlotterT was defined but never used anywhere in the codebase.

Prompt: Why do we define PlotterT in plots.py? Can we not use Plotter directly?

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace the separate SessionPlotManager class with a simpler SessionLayer
dataclass managed directly by PlotGridTabs. This eliminates parallel
collections that required synchronization logic spread across two classes.

Before:
- PlotGridTabs._last_seen_versions: dict[LayerId, int]
- SessionPlotManager._presenters: dict[LayerId, Presenter]
- SessionPlotManager._pipes: dict[LayerId, Pipe]
- SessionPlotManager._dmaps: dict[LayerId, DynamicMap]

After:
- PlotGridTabs._session_layers: dict[LayerId, SessionLayer]
  (each SessionLayer holds presenter, pipe, dmap, last_seen_version)

Benefits:
- Single collection instead of parallel collections
- Version tracking colocated with session components
- Orphan cleanup and plotter replacement detection in one place
- Each SessionLayer is a self-contained unit that knows how to update itself

Original prompt: Help me think through the PlotDataService and
SessionPlotManager. I am trying to understand if there is unnecessary
complexity in trying to keep them in sync. Would a lot of the management
disappear if each session could simply hook into PlotDataService, with
all the per-session objects managed by the service?
Simplify _poll_for_plot_updates() by merging the two iteration phases into
a single pass over orchestrator layers. Previously:
- Phase 1: Iterate _session_layers to push data and clean up invalids
- Phase 2: Iterate orchestrator layers to detect version changes

Now a single pass handles all three concerns:
- Push data updates to existing session pipes
- Clean up invalid session layers (plotter replaced)
- Detect version changes requiring cell rebuilds

Orphan cleanup (layers removed from orchestrator) happens after the main
loop by checking which layer_ids were seen.

Also remove PlotDataService.static_layer_ready() which was dead code:
_create_static_layer_plot() already registers the layer via job_started()
and data_arrived() (or error_occurred() on failure), so the subsequent
static_layer_ready() call was always a no-op.

Original prompt: Does this refactor the door for further simplications?
Anything in the mechanism that can now be done simpler?
The bug: Commit 57e0d61 moved version tracking into SessionLayer, but
SessionLayer was only created when displayable data was available. For
layers waiting for data, no SessionLayer existed, so `last_seen_version`
was always None, triggering rebuilds on every poll cycle.

The fix: Create SessionLayer eagerly for version tracking, with optional
render components that are added lazily when data becomes available.

Refactor SessionLayer to separate concerns:
- SessionComponents: groups presenter/pipe/dmap (always created together)
- SessionLayer: holds layer_id, last_seen_version, and optional components

This keeps the "single collection" benefit of the original refactor while
properly handling layers that don't have displayable data yet.

Original prompt: I think 57e0d61 broke something: The widgets are now
constantly beeing recreating (in every update?).
The comment claimed circular import issues, but there were none:
plots.py doesn't import session_layer.py or plot_data_service.py.
The same file already used Plotter in type annotations elsewhere.

Clean up by importing Plotter/Presenter directly and using an
assertion to help the type checker narrow the Optional type.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add StaticPlotter.compute() method that accepts (and ignores) a data dict,
enabling interface compatibility with dynamic plotters. This keeps caching
logic encapsulated within the plotter instead of requiring external calls
to private methods.

Extract two helper methods in PlotOrchestrator:
- _create_and_register_plotter: creates plotter, calls job_started(), handles errors
- _run_compute: calls plotter.compute(), calls data_arrived(), handles errors

Both static and dynamic paths now use these shared helpers, eliminating
code duplication and the need for _create_static_layer_plot. Also removes
unused cell_id and grid_id parameters from _on_all_jobs_ready.

Prompt: Help me understand why we have `create_static_plot` instead of simply
implementing `StaticPlotter.compute`? Wouldn't this avoid the manual caching
and private method call in _create_static_layer_plot?

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The version-based change detection mechanism already ensures rebuilds
happen when plotters change. The explicit is_valid_for check was
redundant because:

- job_started(plotter) always increments version (invariant guaranteed)
- Version change detection triggers cell rebuilds
- ensure_components handles plotter validity internally during rebuild

Also removes the now-unused SessionLayer.is_valid_for() method.

Adds comprehensive tests for the version invariant in LayerStateMachine
and the version-based rebuild mechanism in PlotGridTabs.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Four files introduced on this branch were using logging.getLogger()
instead of structlog.get_logger(). This fixes them to be consistent
with the rest of the codebase after the structlog migration.

Files fixed:
- dashboard/session_registry.py
- dashboard/session_updater.py
- dashboard/plot_data_service.py
- dashboard/widgets/plot_grid_tabs.py

Prompt: commit these fixes

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@SimonHeybrock SimonHeybrock marked this pull request as ready for review February 2, 2026 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Duplicate notifications with multiple sessions and stale sessions Multi-session support for plots

1 participant