Alpine Data

Designing the Jupyter Notebook Integration

Integration of Jupyter Notebooks into Alpine Chorus, making the popular interactive documents first-class citizens in the data science platform.

Alpine Chorus and Jupyter

About Data Science Notebooks

Interactive notebooks are a fascinating tool for anyone who works and learns by empirical inquiry.

An example of Literate Programming, notebooks allow a person to write and develop in a process that is like having a conversation with the subject matter, iterating in the order of their thoughts, and simultaneously generating a record of their process.

This is akin to stream-of-consciousness thinking and resembles the creative process itself.

For the data scientist, data analyst, or anyone working with data — big and small — interactive notebooks can be used for discovery, feature development, transformation, visualization, and model running. They are equally well-suited to reusable procedures and shared code routines, such as a collection of data cleaning functions that are used repeatedly by a team.

Their popularity in data science made them a frequent user request for support in Chorus.

Editing a notebook
Editing a sample notebook in Alpine Chorus.

Designing the Integration With Chorus

All aspects of the integration were considered from a user-experience-centric point of view when incorporating notebooks into Alpine Chorus.

At the system level, the requirement was defined that there must be isolated sandboxing and authentication/authorization for individual users, so that notebooks would maintain existing access restrictions and data security.

For the visible aspects of the integration, new Chorus abilities were added to make notebooks equivalent to the existing data science collaborative tools:

  • Create a new notebook
  • Import (upload) an existing notebook
  • Copy a notebook to another project
  • Comment and tag a notebook
  • Search by name or content

Once a notebook was opened, an important design requirement was that the general user experience of composing and editing a notebook should not change, so that using notebooks in Alpine Chorus would be familiar to people who have already used stand-alone notebooks, and people new to the tool could learn using the existing community resources (documentation, example notebooks, StackExchange, etc.). Editing, viewing, inserting, and running functions were left unchanged.

Changes were made, nevertheless, to improve the user experience and to benefit the user. Features were removed, and features were added.

Using a notebook within the Chorus platform, rather than running it locally on the data scientist’s own computer (as is the most common case), means that managing kernels manually is no longer relevant. So that function was removed. In addition, “checkpoints” were replaced with the existing Chorus file versioning behavior for a consistent experience.

Add Data to a Notebook

In Chorus, data sources (database or Hadoop) are added to the application as the source locations for data used when creating analytics. These sources can be shared by everyone, or limited to the specific permissions of the user. Once the data source is made available, users can then browse the source and identify specific data collections — called datasets — to be attached to a project workspace, the team environment where the collaboration and data science work happens. Workspaces are also where notebooks exist.

To make a simple user experience, a new feature was designed that lets a user easily add the workspace’s data sets to the current notebook. The existing notebook menu was augmented with a new “Data” section, and the “Import Dataset” action was made available in this menu.

Add data to notebook
Add Data Dialog: Easily add shared data to a notebook using the familiar Chorus behaviors.

This provides the means to select from the datasets already available to the current workspace. Choose a dataset, and Chorus automatically generates the data frame in the notebook.

Notebooks and Recurring Jobs

Like software development, an interactive notebook goes through a lifecycle. In its finished state, the notebook might be a data transformation process or an analytic model for business questions. If that model needs to be executed repeatedly, the regular way of running a notebook is burdensome and unwieldy.

To solve this problem, the Chorus job functionality was extended to include notebooks. A job, an existing feature in the platform, lets a user create a sequence of tasks and run them either on-demand or on a schedule. Now, a notebook could be simply be included as another task.

Job Results
Job Results: A notebook has been successfully run as a scheduled job.