What is Jupyter, and why do you care? After all, Jupyter has never become a buzzword like data science, artificial intelligence, or Web 2.0. Unlike those big abstractions, Jupyter is very concrete. It’s an open-source project, a piece of software, that does specific things.
When you combine Jupyter with containers and a source management system like GitHub, you get a platform for collaboration: on coding, on data analysis, on visualization, on anything that can be done in most programming languages.
The Jupyter architecture
- The notebook front end. The front end allows you to edit and run notebooks. The front end is a JavaScript application that’s delivered to your browser, like any other web application. The front end is responsible for storing your notebook, managing your notebooks in your local filesystem, and sending them to the Jupyter server.
- The Jupyter server, which is either a relatively simple application that runs on your laptop, or a multi-user server. The Jupyter project’s JupyterHub is the most widely used multi-user server for Jupyter.
- The kernel protocol, which allows the server to offload the task of running code to a language-specific kernel. Jupyter ships with kernels for Python 2 and Python 3, but kernels for many other languages are available.
This architecture, though simple, is very flexible. You can substitute your own front end, as interact has done: its main responsibility is managing documents. You can build a front end that implements real-time dashboards; you can use the Jupyter protocol to implement support for other languages; you can implement custom servers to create new media types. O’Reilly Media’s Orioles combine Jupyter Notebooks with a parallel streaming video narrative, synchronized to the notebook.
The Jupyter workflow
It’s most productive to use Jupyter together with Git and Docker, two tools that many researchers aren’t familiar with.
Git is a version control system: it’s used for tracking different versions of software and recording the difference between versions. It allows you to roll back to an earlier version; it also allows code sharing, and it allows multiple people to work on the same codebase and resolve conflicts between their changes. Jupyter Notebooks are just alphanumeric data structures: they look like code, and Git has no problem working with them.
Docker is a tool for automating application deployment. It allows you to “shrinkwrap” everything that’s needed for an application to run: Jupyter itself, the notebook, all the libraries, and any other tools (data, etc.) needed to run the application—even a stripped-down operating system (typically Linux). One of the most painful parts of sharing code in any significantly popular programming language is resolving conflicts between libraries, the programming language, the operating system, etc. If you’ve tried to install someone else’s software, you’ve probably experienced version hell: their project requires database X, but X needs library Y version 1.9, and you have version 1.8 installed. When you try to get 1.9, you find that it won’t build. To build version 1.9 of library Y, you need version 3.4 of library Z, but Z has never been ported to your operating system. And so on. Docker eliminates the problem: instead of delivering your code by itself, you deliver the entire package—everything that’s needed in the runtime environment. You start the container, the container starts its own operating system, which starts Jupyter, and everything just works.
I won’t describe how to use Git and Docker in detail; they’re tools that could be simpler, and a number of organizations (including O’Reilly) are working on tools to simplify integrating Jupyter with Git and Docker. With Git and Docker in the picture, the workflow looks like this:
- Use Git locally (or use an external service, like GitHub): whenever you reach a significant point in your work, commit the results to the Git repository. You’ll now be able to revert to the current version should you need to.
- Keep a Dockerfile in your repository, along with your notebooks. Use the Dockerfile to record everything you need to run the notebooks: libraries, data, utilities. There are pre-built Docker images that contain most of what you need, for a number of common environments, so, in practice, you don’t have to modify the Dockerfile much.
- Run the Jupyter server inside the Docker container. That keeps everything clean and isolated.
- You can push your Docker image to a registry, such as DockerHub. At that point, other users can pull your image, build a container that will match yours, and run your code without worrying about version hell.
The Jupyter workflow requires some discipline, but it’s worth it. The Jupyter project maintains a collection of Dockerfiles for many common configurations: Python with numeric libraries for data analysis, Python and Scala with Apache Spark (a general purpose engine for distributed computation with data), Python with R, and many others. Using these containers eliminates installation pain completely; all you need to do is install Docker, and give a command to start and build the container. You can download and start a container with a single command.
Jupyter at work and school
IBM recently published a case study describing work they did for the Executive Transportation Group, a black car service operating in New York City. Data analysts used Jupyter to analyze data about rides and drivers, using Apache Spark to distribute the computation across many computers. They used Spark’s distributed computing capabilities via the Toree kernel; Toree and Spark allowed them to process tens of millions of geographic lookups in a timely way.
To create the ETG project, IBM contributed to, and took advantage of, several extensions to Jupyter—a give-and-take relationship that is only possible with an open source project. The team used Jupyter interactive (“declarative”) widgets to build dashboards that allowed them to communicate results with staff from ETG. Interactive widgets let developers provide the kinds of controls you’d expect for graphical applications: sliders, buttons, and other web components. The dashboard extension makes it possible to build with complex layouts, rather than the linear top-to-bottom layout that notebooks give you by default.
The IBM and ETG teams were able to iterate rapidly as they refined their analytic tools: they could deploy the dashboard as a web application, collect feedback and questions from their users, modify the application, and iterate. Jupyter enabled an agile process for analyzing the data and building the tools that ETG needed.
Lorena Barba, professor of mechanical engineering at George Washington University, is a leader in using Jupyter in teaching. She calls Jupyter Notebooks “computable content,” and calls them “a killer app for STEM education,” because notebooks make it possible to share material directly with students. It’s not just written on the blackboard; it’s shared in a way that allows students to interact directly, and it can be combined with text, links, images, videos. You don’t learn to code through lectures; you learn by interacting and experimenting.
The new Foundations of Data Science course at UC Berkeley, required of all undergraduates, demonstrates this approach at scale. Thousands of students receive assignments, access to large data sets, and instructions for completing the assignments as notebooks. All the code runs on JupyterHub in the cloud, eliminating the problem of software installation. According to the instructors, the course “wouldn’t be possible without Jupyter Notebooks, which enable browser-based computation, avoiding the need for students to install software, transfer files, or update libraries.” Extensions to Jupyter, which will be incorporated into future releases, support real-time interaction between students and teachers: questions, answers, help on assignments, all in the context of the actual code the student is writing.
Extending Jupyter
We’ve just talked about the widget and dashboard extensions. There are also widgets for more advanced tasks, like creating maps based on OpenStreetMap, and doing interactive data visualization in 2D and 3D. There’s also an extension that “bridges” Jupyter and d3.js, the most advanced library for building data-driven web documents.
The Jupyter ecosystem also includes tools for publishing your documents in different ways. For example, nbviewer is a simple tool that allows non-programmers to view Jupyter Notebooks. It doesn’t run the code or allow modifications; it just renders the “finished product” as a web page. Nbviewer can be installed locally; there is also a public nbviewer service, which can render any notebook that’s available online. All you need is the URL.
Nbviewer is based on on nbconvert, which converts notebooks into many different static formats, including HTML, LaTeX, PDF, scripts (just the code, as an executable script, without the rest of the notebook), and slides.
While there’s no single source listing all of Jupyter’s extensions, widgets, and tools, there is a lively ecosystem of developers working on building features for the Jupyter platform.
JupyterLab and the future
JupyterLab is the next important change in Jupyter’s universe. The JupyterLab Computational Environment rethinks Jupyter as an IDE, an integrated development environment, for working with software.
Much of what’s in JupyterLab is already built into Jupyter; the JupyterLab project is really about taking features that are already baked in and exposing them so they can be used more flexibly. There’s a file manager; a graphical console; a terminal window for monitoring the system; an advanced text editor; and, of course, an integrated help system. What’s new is that these features are exposed in new ways: it’s easier to build dashboards, to access the tooling needed to create and debug more complex applications.
JupyterLab isn’t ready for general usage yet; though, if you’re adventurous, you can find it on GitHub. We hope there will be a public beta by the time of JupyterCon.
The Zero to JupyterHub project makes it much easier to run JupyterHub in the cloud: specifically, Google Compute Engine and Microsoft Azure (with more to come). Running JupyterHub in the cloud means you can make notebooks accessible to a very broad audience, without worrying about computing resources. Zero to JupyterHub uses the Kubernetes, Helm, and Docker projects to manage the use of services in the cloud and to provide standard and robust computing environments.
The Jupyter project is working toward real-time collaboration in notebooks: allowing multiple users to edit a notebook simultaneously. We’re used to dynamic collaboration on Google Docs and other online platforms. Why not Jupyter? There are extensions that allow notebooks to be hosted on Google Drive; we’re looking to see collaboration baked directly into JupyterHub so that it’s available anywhere in workgroup and enterprise deployments.
Jupyter has become a standard for scientific research and data analysis. It packages computation and argument together, letting you build “computational narratives”; it allows you to publish your narratives in many formats, from live online notebooks to slide decks; it supports many popular programming languages; and it simplifies the problem of distributing working software to teammates and associates. There are many tools, ranging from traditional IDEs to analytics platforms, that solve one or two of these problems. Jupyter is the only one that solves them all. To succeed in digital transformation, businesses need to adopt tools that have been proven: tools that enable collaboration, sharing, and rapid deployment. That’s what Jupyter is about.
0 Comments