DiscoverPython Bytes#220 What, why, and where of friendly errors in Python
#220 What, why, and where of friendly errors in Python

#220 What, why, and where of friendly errors in Python

Update: 2021-02-111
Share

Description

Sponsored by Datadog: pythonbytes.fm/datadog



Special guest: Hannah Stepanek



Watch on YouTube




Michael #1: We Downloaded 10,000,000 Jupyter Notebooks From Github – This Is What We Learned




  • by Alena Guzharina from JetBrains

  • Used the hundreds of thousands of publicly accessible repos on GitHub to learn more about the current state of data science. I think it’s inspired by work showcased here on Talk Python.

  • 2 years ago there were 1,230,000 Jupyter Notebooks published on GitHub. By October 2020 this number had grown 8 times, and we were able to download 9,720,000 notebooks. 8x growth.

  • Despite the rapid growth in popularity of R and Julia in recent years, Python still remains the most commonly used language for writing code in Jupyter Notebooks by an enormous margin.

  • Python 2 went from 53% → 11% in the last two years.

  • Interesting graphs about package usage

  • Not all notebooks are story telling with code: 50% of notebooks contain fewer than 4 Markdown cells and more than 66 code cells.

  • Although there are some outliers, like notebooks with more than 25,000 code lines, 95% of the notebooks contain less than 465 lines of code.



Brian #2: pytest-pythonpath




  • plugin for adding to the PYTHONPATH from the pytests.ini file before tests run

  • Mentioned briefly in episode 62 as a temporary stopgap until you set up a proper package install for your code. (cringing at my arrogance).

  • Lots of projects are NOT packages. For example, applications.

  • I’ve been working with more and more people to get started on testing and the first thing that often comes up is “My tests can’t see my code. Please fix.”

  • Example

    • proj/src/stuff_you_want_to_test.py

    • proj/tests/test_code.py

    • You can’t import stuff_you_want_to_test.py from the proj/tests directory by default.


  • The more I look at the problem, the more I appreciate the simplicity of pytest-pythonpath

  • pytest-pythonpath does one thing I really care about:

    • Add this to a pytest.ini file at the proj level:




    [pytest] 
python_paths = src



  • That’s it. That’s all you have to do to fix the above problem.

  • Paths relative to the directory that pytest.ini is in. Which should be a parent or grandparent of the tests directory.

  • I really can’t think of a simpler way for people to get around this problem.



Hannah #3: Thinking in Pandas




  • Pandas dependency hierarchy (simplified):

    • Pandas -> NumPy -> BLAS (Basic Linear Algebra Subprograms)


  • Languages:

  • -


    - Python  ->      C     -> Assembly
df["C"] = df["A"] + df["B"]

A = [ 1
4
2
0 ]
B = [ 3
2
5
1 ]
C = [ 1 + 3
4 + 2
2 + 5
0 + 1 ]



  • Pandas tries to get the best performance by running operations in parallel.

  • You might think we could speed this problem up by doing something like this:



    Thread 1: 1 + 3
Thread 2: 4 + 2
Thread 3: 2 + 5
Thread 4: 0 + 1



  • However, the GIL (Global Interpreter Lock) prevents us from achieving the performance improvement we are hoping for.

  • Below is an example of a common threading problem and how a lock solves that problem.

  • -


    Thread 1                  total                    Thread 2
1 + 3 + 4 + 2 0 0 + 5
10 0 + 6 + 2
total += 10 0 13
total =10 0 total += 13
10 total = 13
13

Thread 1 total Thread 2
1 + 3 + 4 + 2 0 unlocked 0 + 5
10 0 unlocked + 6 + 2
total += 10 0 locked 13
total =10 0 locked
10 unlocked
10 locked total += 13
10 locked total = 13
23 unlocked



  • As it turns out, because Python manages memory for you every object in Python would be subject to these kinds of threading issues:



    a = 1     # reference count = 1
b = a # reference count = 2
del(b) # reference count = 1
del(a) # reference count = 0



  • So, the GIL was invented to avoid this headache which only lets one thread run at a time.

  • Certain parts of the Pandas dependency hierarchy are not subject to the GIL (simplified):

    • Pandas -> NumPy -> BLAS (Basic Linear Algebra Subprograms)

    • GIL -> no GIL -> hardware optimizations


  • So we can get around the GIL in C land but what kind of optimizations does BLAS provide us with?

    • Parallel operations inside the CPU via Vector registers


  • A vector register is like a regular register but instead of holding one value it can hold multiple values.



| 1 | 4 | 2 | 0 |


+ + + +

| 3 | 2 | 5 | 1 |


= = = =

| 4 | 6 | 7 | 1 |



  • Vector registers are only so large though, so the Dataframe is broken up into chunks and the vector operations are performed on each chunk.



Michael #4: Quickle




  • Fast. Benchmarks show it’s among the fastest serialization methods for Python.

  • Safe. Unlike pickle, deserializing a user provided message doesn’t allow for arbitrary code execution.

  • Flexible. Unlike msgpack or json, Quickle natively supports a wide range of Python builtin types.

  • Versioning. Quickle supports “schema evolution”. Messages can be sent between clients with different schemas without error.

  • Example



    >>> import quickle
>>> data = quickle.dumps({"hello": "world"})
>>> quickle.loads(data)
{'hello': 'world'}


Brian #5: what(), why(), where(), explain(), more() from friendly-traceback console




  • Do this:



    $ pip install friendly-friendly_traceback.install() 
$ python -i
>>> import friendly_traceback
>>> friendly_traceback.start_console()
>>>



  • Now, after an exception happens, you can ask questions about it.



    >>> pass = 1

Traceback (most recent call last):
File "[HTML_REMOVED]", line 1
pass = 1
^
SyntaxError: invalid syntax
>>> what()
SyntaxError: invalid syntax

A `SyntaxError` occurs when Python cannot understand your code.

>>> why()
You were trying to assign a value to the Python keyword `pass`.
This is not allowed.

>>> where()
Python could not understand the code in the file
'[HTML_REMOVED]'
beyond the location indicated by --> and ^.

-->1: pass = 1
^



  • Cool for teaching or learning.



Hannah #6: Bandit




  • Bandit is a static analysis security tool.

  • It’s like a linter but for security issues.



    pip install bandit
bandit -r .



  • I prefer to run it in a git pre-commit hook:



# .pre-commit-config.yaml
repos:
repo: https://github.com/PyCQA/bandit
rev: '1.7.6'
hooks:
- id: bandit




    assert len(foo) == 1  # nosec


Extras:



Brian:




  • Meetups this week 2/3 done.

    • NOAA Tuesday, Aberdeen this morning - “pytest Fixtures”

    • PDX West tomorrow - Michael Presenting “Python Memory Deep Dive”


  • Updated my training page, testandcode.com/training

    • Feedback welcome.

    • I really like working directly with teams and now that trainings can be virtual, a couple half days is super easy to do.




Michael:





Joke:



Sent in via Michel Rogers-Vallée, Dan Bader, and Allan Mcelroy. :)



PEP 8 Song



Watch on YouTube




Comments 
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

#220 What, why, and where of friendly errors in Python

#220 What, why, and where of friendly errors in Python

Michael Kennedy (@mkennedy)