06-NBconvert-Doc-Draft
How to Use NBConvert¶
NBconvert migration¶
NBconvert has now been merged into IPython itself. You will need IPython 1.0 or above to have this works (asuuming the API have not changed)
Intro¶
In this post I will introduce you to the programatic API of nbconvert to show you how to use it in various context.
For this I will use one of @jakevdp great blog post. I've explicitely chosen a post with no javascript tricks as Jake seem to be found of right now, for the reason that the becommings of embeding javascript in nbviewer, which is based on nbconvert is not fully decided yet.
This will not focus on using the command line tool to convert file. The attentive reader will point-out that no data are read from, or written to disk during the conversion process. Indeed, nbconvert as been though as much as possible to avoid IO operation and work as well in a database, or web-based environement.
Quick overview¶
The main principle of nbconvert is to instanciate a Exporter
that controle
a pipeline through which each notebook you want to export with go through.
Let's start by importing what we need from the API, and download @jakevdp's notebook.
import requests
response = requests.get('http://jakevdp.github.com/downloads/notebooks/XKCD_plots.ipynb')
response.content[0:60]+'...'
We read the response into a slightly more convenient format which represent IPython notebook. There are not real advantages for now, except some convenient methods, but with time this structure should be able to guarantee that the notebook structure is valid.
from IPython.nbformat import current as nbformat
jake_notebook = nbformat.reads_json(response.content)
jake_notebook.worksheets[0].cells[0]
So we have here Jake's notebook in a convenient for, which is mainly a Super-Powered dict and list nested. You don't need to worry about the exact structure.
The nbconvert API exposes some basic exporter for common format and default options. We will start by using one of them. First we import it, instanciate an instance with all the defautl parameters and fed it the downloaded notebook.
import IPython.nbconvert
from IPython.config import Config
from IPython.nbconvert import HTMLExporter
## I use basic here to have less boilerplate and headers in the HTML.
## we'll see later how to pass config to exporters.
exportHtml = HTMLExporter(config=Config({'HTMLExporter':{'default_template':'basic'}}))
(body,resources) = exportHtml.from_notebook_node(jake_notebook)
The exporter returns a tuple containing the body of the converted notebook, here raw HTML, as well as a resources dict. The resource dict contains (among many things) the extracted PNG, JPG [...etc] from the notebook when applicable. The basic HTML exporter does keep them as embeded base64 into the notebook, but one can do ask the figures to be extracted. Cf advance use. So for now the resource dict should be mostly empty, except for 1 key containing some css, and 2 others whose content will be obvious.
Exporter are stateless, you won't be able to extract any usefull information (except their configuration) from them.
You can directly re-use the instance to convert another notebook. Each exporter expose for convenience a from_file
and from_filename
methods if you need.
print resources.keys()
print resources['metadata']
print resources['output_extension']
# print resources['inlining'] # too lng to be shown
# Part of the body, here the first Heading
start = body.index('<h1 id', )
print body[:400]+'...'
You can directly write the body into an HTML file if you wish, as you see it does not contains any body tag, or style declaration, but thoses are included in the default HtmlExporter if you do not pass it a config object as I did.
Extracting Figures¶
When exporting one might want to extract the base64 encoded figures to separate files, this is by default what does the RstExporter does, let see how to use it.
from IPython.nbconvert import RSTExporter
rst_export = RSTExporter()
(body,resources) = rst_export.from_notebook_node(jake_notebook)
print body[:970]+'...'
print '[.....]'
print body[800:1200]+'...'
Here we see that base64 images are not embeded, but we get what look like file name. Actually those are (Configurable) keys to get back the binary data from the resources dict we havent inspected earlier.
So when writing a Rst Plugin for any blogengine, Sphinx or anything else, you will be responsible for writing all those data to disk, in the right place. Of course to help you in this task all those naming are configurable in the right place.
let's try to see how to get one of these images
resources['outputs'].keys()
We have extracted 5 binary figures, here png
s, but they could have been svg, and then wouldn't appear in the binary sub dict.
keep in mind that a object having multiple repr will store all it's repr in the notebook.
Hence if you provide _repr_javascript_
,_repr_latex_
and _repr_png_
to an object, you will be able to determine at conversion time which representaition is the more appropriate. You could even decide to show all the representaition of an object, it's up to you. But this will require beeing a little more involve and write a few line of Jinja template. This will probably be the subject of another tutorial.
Back to our images,
from IPython.display import Image
Image(data=resources['outputs']['output_3_0.png'],format='png')
Yep, this is indeed the image we were expecting, and I was able to see it without ever writing or reading it from disk. I don't think I'll have to show to you what to do with those data, as if you are here you are most probably familiar with IO.
Extracting figures with HTML Exporter ?¶
Use case:
I write an awesome blog in HTML, and I want all but having base64 embeded images. Having one html file with all inside is nice to send to coworker, but I definitively want resources to be cached ! So I need an HTML exporter, and I want it to extract the figures !
Some theory¶
The process of converting a notebook to a another format with the nbconvert Exporters happend in a few steps:
- Get the notebook data and other required files. (you are responsible for that)
- Feed them to the exporter that will
- sequentially feed the data to a number of
Transformers
. Transformer only act on the structure of the notebook, and have access to it all. - feed the notebook through the jinja templating engine
- the use templates are configurable.
- templates make use of configurable macros called filters.
- sequentially feed the data to a number of
- The exporter return the converted notebook as well as other relevant resources as a tuple.
- Write what you need to disk, or elsewhere. (You are responsible for it)
Here we'll be interested in the Transformers
. Each Transformer
is applied successively and in order on the notebook before going through the conversion process.
We provide some transformer that do some modification on the notebook structure by default.
One of them, the ExtractOutputTransformer
is responsible for crawling notebook,
finding all the figures, and put them into the resources directory, as well as choosing the key
(filename_xx_y.extension
) that can replace the figure in the template.
The ExtractOutputTransformer
is special in the fact that it should be availlable on all Exporter
s, but is just inactive by default on some exporter.
# second transformer shoudl be Instance of ExtractFigureTransformer
exportHtml._transformers # 3rd one shouel be <ExtractOutputTransformer>
To enable it we will use IPython configuration/Traitlets system. If you are have already set some IPython configuration options, this will look pretty familiar to you. Configuration option are always of the form:
ClassName.attribute_name = value
A few ways exist to create such config, like reading a config file in your profile, but you can also do it programatically usign a dictionary. Let's create such a config object, and see the difference if we pass it to our HtmlExporter
from IPython.config import Config
c = Config({
'ExtractOutputTransformer':{'enabled':True}
})
exportHtml = HTMLExporter()
exportHtml_and_figs = HTMLExporter(config=c)
(_, resources) = exportHtml.from_notebook_node(jake_notebook)
(_, resources_with_fig) = exportHtml_and_figs.from_notebook_node(jake_notebook)
print 'resources without the "figures" key :'
print resources.keys()
print ''
print 'Here we have one more field '
print resources_with_fig.keys()
resources_with_fig['outputs'].keys()
So now you can loop through the dict and write all those figures to disk in the right place...
Custom transformer¶
Of course you can imagine many transformation that you would like to apply to a notebook. This is one of the reason we provide a way to register your own transformers that will be applied to the notebook after the default ones.
To do so you'll have to pass an ordered list of Transformer
s to the Exporter constructor.
But what is an transformer ? Transformer can be either decorated function for dead-simple Transformer
s that apply
independently to each cell, for more advance transformation that support configurability You have to inherit from
Transformer
and define a call
method as we'll see below.
All transforers have a magic attribute that allows it to be activated/disactivate from the config dict.
from IPython.nbconvert.transformers import Transformer
import IPython.config
print "Four relevant docstring"
print '============================='
print Transformer.__doc__
print '============================='
print Transformer.call.__doc__
print '============================='
print Transformer.transform_cell.__doc__
print '============================='
We don't provide convenient method to be aplied on each worksheet as the data structure for worksheet will be removed. (not the worksheet functionnality, which is still on it's way)
Example¶
I'll now demonstrate a specific example requested while nbconvert 2 was beeing developped. The ability to exclude cell from the conversion process based on their index.
I'll let you imagin how to inject cell, if what you just want is to happend static content at the beginning/end of a notebook, plese refer to templating section, it will be much easier and cleaner.
from IPython.utils.traitlets import Integer
class PelicanSubCell(Transformer):
"""A Pelican specific transformer to remove somme of the cells of a notebook"""
# I could also read the cells from nbc.metadata.pelican is someone wrote a JS extension
# But I'll stay with configurable value.
start = Integer(0, config=True, help="first cell of notebook to be converted")
end = Integer(-1, config=True, help="last cell of notebook to be converted")
def call(self, nb, resources):
#nbc = deepcopy(nb)
nbc = nb
# don't print in real transformer !!!
print "I'll keep only cells from ", self.start, "to ", self.end, "\n\n"
for worksheet in nbc.worksheets :
cells = worksheet.cells[:]
worksheet.cells = cells[self.start:self.end]
return nbc, resources
# I create this on the fly, but this could be loaded from a DB, and config object support merging...
c = Config({
'PelicanSubCell':{
'enabled':True,
'start':4,
'end':6,
}
})
I'm creating a pelican exporter that take PelicanSubCell
extra transformers and a config
object as parameter. This might seem redundant, but with configuration system you'll see that one can register an inactive transformer on all exporters and activate it at will form its config files and command line.
pelican = RSTExporter(transformers=[PelicanSubCell], config=c)
print pelican.from_notebook_node(jake_notebook)[0]
Update¶
All part on figure naming in template removed since many thinfs in API have changed
I think this is enough for now, As you have seen there are a few bugs here and there I need to correct before continuing. Next time I'll show you how to modify template :
{%- extends 'fullhtml.tpl' -%}
{% block input_group -%}
{% endblock input_group %}
... and you just removed all the codecell by keeping the output and markdown codecell, isn't that wonderfull ? You want to wrap each cell in your own div ?
{%- extends 'fullhtml.tpl' -%}
{% block codecell %}
<div class="myclass">
{{ super() }}
</div>
{%- endblock codecell %}
Try to look at what Jinja can do, thenlearn about Jinja Filters and imagine they can magically read your config file.
For example we provide a filter that highlight by presupposing code is Python. Or one that wraps text at a default length of 80 char... Want a rot13 filter on some codecell when doing exercises for student ? See you next time !
More...¶
One more example from one Pull-Request.
from IPython.nbconvert.filters.highlight import _pygment_highlight
from pygments.formatters import HtmlFormatter
from IPython.nbconvert.exporters import HTMLExporter
from IPython.config import Config
from IPython.nbformat import current as nbformat
def my_highlight(source, language='ipython'):
formatter = HtmlFormatter(cssclass='highlight-ipynb')
return _pygment_highlight(source, formatter, language)
c = Config({'CSSHtmlHeaderTransformer':
{'enabled':True, 'highlight_class':'highlight-ipynb'}})
exportHtml = HTMLExporter( config=c , filters={'highlight': my_highlight} )
(body,resources) = exportHtml.from_notebook_node(jake_notebook)
from jinja2 import DictLoader
dl = DictLoader({'html_full.tpl':
"""
{%- extends 'html_basic.tpl' -%}
{% block footer %}
FOOOOOOOOTEEEEER
{% endblock footer %}
"""})
exportHtml = HTMLExporter( config=None , filters={'highlight': my_highlight}, extra_loaders=[dl] )
(body,resources) = exportHtml.from_notebook_node(jake_notebook)
for l in body.split('\n')[-4:]:
print l