Articles

42: Supplemental Materials - Python Linear Algebra Packages


42: Supplemental Materials - Python Linear Algebra Packages

A first example of using pybind11 ¶

Create a new subdirectory - e.g. example1 and create the following 5 files in it:

First write the C++ header and implementation files

Next write the C++ wrapper code using pybind11 in wrap.cpp . The arguments "i"_a=1, "j"_a=2 in the exported function definition tells pybind11 to generate variables named i with default value 1 and j with default value 2 for the add function.

Finally, write the setup.py file to compile the extension module. This is mostly boilerplate.

Now build the extension module in the subdirectory with these files

And if you are successful, you should now see a new funcs.so extension module. We can write a test_funcs.py file test the extension module:

And finally, running the test should not generate any error messages:


Background

The study of aging and in particular human aging has become a very active field in genomics [1, 2], in particular due to the role of DNA methylation [3]. Methylation serves as an epigenetic marker as it measures the state of cells as they undergo developmental changes [4]. Methylation however continues also beyond the developmental stage, as humans age, notwithstanding at significantly lower rate [5–8]. Therefore DNA methylation serves as a central epigenetic mechanism that helps define and maintain the state of cells during the entire life cycle [9–11]. In order to measure genome-wide levels of DNA methylation, techniques such as bisulfite sequencing and DNA methylation arrays are used [12].

In his seminal paper [13], Steve Horvath defined the term epigenetic clock, which later appeared to be a very robust estimation to human age (see e.g. [14]). The scheme is divided into two: children up to age twenty, and adults. A raw estimated age is first calculated by a weighted sum of 353 sites. Then, For the children, this raw age is log transformed to reflect the real chronological age. For adults, this raw age us taken as is. This approach, of using an untransformed epigenetic state as chronological age induces linearity between these two measures. This linearity can be compared to the classical concept from molecular evolutionary known as the as the molecular clock (MC) [15, 16].

The rate constancy of MC can be relaxed by a mechanism dubbed the universal pacemaker (UPM or simply pacemaker - PM) of genome evolution [17–20]. Under the UPM, genes within a genome evolving along a linage, can vary their intrinsic mutation rate in concert with all other genes in the genome. Figure 1 illustrates pictorially differences between the two models - UPM and MC. The UPM mechanism can be adapted from molecular evolution to model the process of methylation.

Molecular Clock vs Universal PaceMaker: Solid lines (colors) represent different methylation sites. Vertical (dashed) lines represent time points. Hence dots along dashed lines correspond to (log) methylation rates at that very time point of each methylation site. Under the Molecular Clock (MC) model (left), methylation rates of sites differ among each other but are constant in time. By contrast, under the Universal PaceMaker (UPM) model (right), rates may vary during with time but the pairwise ratio between sites rates remains constant (diference between log rates is constant)

In a line of works [21–23] we have developed a model borrowing ideas from the UPM to describe methylating site in a body. While the time linearity described above can be perceived as the MC, under the UPM approach, the adapted model assumes that all sites change according to an adjusted time, which is a non-linear function of the chronological time. This paradigm - denoted the epigenetic pacemaker (EPM) - becomes appealing for studying age related changes in methylation, where methylating sites correspond to evolving genes.

The first work of the EPM [21] used a simple approach to find the optimal, maximum likelihood, values for the variables sought, what restricted the inputs analyzed to small sizes, and limited the biological inference. In a recent work [22] we have devised a conditional expectation maximization (CEM) algorithm [24] which is an extention to the widespread expectation maximization (EM) algorithm [25]. CEM is applied when optimizing the likelihood function over the entire parameter set simultaneously is hard. The parameters are partitioned into two or more subsets and optimiztion is done separately in an alternating manner. In our specific setting, this partitioning separated the variable set into site variables and time variables that are optimized separately in two alternating steps. Here however, we combine the structure of the EPM model with insights from linear algebra, and with the help of symbolic algebra tools (e.g. Sage-math [26]) trace the use of variables through the entire linear algebra stage. The latter allows us to bypass that heavy step completely, resulted in a prominent improvement, both practical and theoretical, and in both running time and memory space. This improvement is complemented by a linear time, closed form solution to the second step, the time step, of the CEM. The unification of these two improved steps under a combined high level algorithm as the CEM yields a very fast algorithm that ends in few iterations of the EM algorithm.

The improvements described above give rise to a substantial increase in the scale of inputs analyzed by the method, and enable the applications of the new approach to data sets that could not be analyzed before.


Matlab, Python, Julia: What to Choose in Economics?

We perform a comparison of Matlab, Python and Julia as programming languages to be used for implementing global nonlinear solution techniques. We consider two popular applications: a neoclassical growth model and a new Keynesian model. The goal of our analysis is twofold: First, it is aimed at helping researchers in economics choose the programming language that is best suited to their applications and, if needed, help them transit from one programming language to another. Second, our collections of routines can be viewed as a toolbox with a special emphasis on techniques for dealing with high dimensional economic problems. We provide the routines in the three languages for constructing random and quasi-random grids, low-cost monomial integration, various global solution methods, routines for checking the accuracy of the solutions as well as examples of parallelization. Our global solution methods are not only accurate but also fast. Solving a new Keynesian model with eight state variables only takes a few seconds, even in the presence of an active zero lower bound on nominal interest rates. This speed is important because it allows the model to be solved repeatedly as would be required for estimation.

This is a preview of subscription content, access via your institution.


Array proliferation and interoperability

NumPy provides in-memory, multidimensional, homogeneously typed (that is, single-pointer and strided) arrays on CPUs. It runs on machines ranging from embedded devices to the world’s largest supercomputers, with performance approaching that of compiled languages. For most its existence, NumPy addressed the vast majority of array computation use cases.

However, scientific datasets now routinely exceed the memory capacity of a single machine and may be stored on multiple machines or in the cloud. In addition, the recent need to accelerate deep-learning and artificial intelligence applications has led to the emergence of specialized accelerator hardware, including graphics processing units (GPUs), tensor processing units (TPUs) and field-programmable gate arrays (FPGAs). Owing to its in-memory data model, NumPy is currently unable to directly utilize such storage and specialized hardware. However, both distributed data and also the parallel execution of GPUs, TPUs and FPGAs map well to the paradigm of array programming: therefore leading to a gap between available modern hardware architectures and the tools necessary to leverage their computational power.

The community’s efforts to fill this gap led to a proliferation of new array implementations. For example, each deep-learning framework created its own arrays the PyTorch 38 , Tensorflow 39 , Apache MXNet 40 and JAX arrays all have the capability to run on CPUs and GPUs in a distributed fashion, using lazy evaluation to allow for additional performance optimizations. SciPy and PyData/Sparse both provide sparse arrays, which typically contain few non-zero values and store only those in memory for efficiency. In addition, there are projects that build on NumPy arrays as data containers, and extend its capabilities. Distributed arrays are made possible that way by Dask, and labelled arrays—referring to dimensions of an array by name rather than by index for clarity, compare x[:, 1] versus x.loc[:, 'time']—by xarray 41 .

Such libraries often mimic the NumPy API, because this lowers the barrier to entry for newcomers and provides the wider community with a stable array programming interface. This, in turn, prevents disruptive schisms such as the divergence between Numeric and Numarray. But exploring new ways of working with arrays is experimental by nature and, in fact, several promising libraries (such as Theano and Caffe) have already ceased development. And each time that a user decides to try a new technology, they must change import statements and ensure that the new library implements all the parts of the NumPy API they currently use.

Ideally, operating on specialized arrays using NumPy functions or semantics would simply work, so that users could write code once, and would then benefit from switching between NumPy arrays, GPU arrays, distributed arrays and so forth as appropriate. To support array operations between external array objects, NumPy therefore added the capability to act as a central coordination mechanism with a well specified API (Fig. 2).

To facilitate this interoperability, NumPy provides ‘protocols’ (or contracts of operation), that allow for specialized arrays to be passed to NumPy functions (Fig. 3). NumPy, in turn, dispatches operations to the originating library, as required. Over four hundred of the most popular NumPy functions are supported. The protocols are implemented by widely used libraries such as Dask, CuPy, xarray and PyData/Sparse. Thanks to these developments, users can now, for example, scale their computation from a single machine to distributed systems using Dask. The protocols also compose well, allowing users to redeploy NumPy code at scale on distributed, multi-GPU systems via, for instance, CuPy arrays embedded in Dask arrays. Using NumPy’s high-level API, users can leverage highly parallel code execution on multiple systems with millions of cores, all with minimal code changes 42 .

In this example, NumPy’s ‘mean’ function is called on a Dask array. The call succeeds by dispatching to the appropriate library implementation (in this case, Dask) and results in a new Dask array. Compare this code to the example code in Fig. 1g.

These array protocols are now a key feature of NumPy, and are expected to only increase in importance. The NumPy developers—many of whom are authors of this Review—iteratively refine and add protocol designs to improve utility and simplify adoption.


Project organization and community

Governance

SciPy adopted an official governance document (https://docs.scipy.org/doc/scipy/reference/dev/governance/governance.html) on August 3, 2017. A steering council, currently composed of 18 members, oversees daily development of the project by contributing code and reviewing contributions from the community. Council members have commit rights to the project repository, but they are expected to merge changes only when there are no substantive community objections. The chair of the steering council, Ralf Gommers, is responsible for initiating biannual technical reviews of project direction and summarizing any private council activities to the broader community. The project’s benevolent dictator for life, Pauli Virtanen, has overruling authority on any matter, but is expected to act in good faith and only exercise this authority when the steering council cannot reach agreement.

SciPy’s official code of conduct was approved on October 24, 2017. In summary, there are five specific guidelines: be open to everyone participating in our community be empathetic and patient in resolving conflicts be collaborative, as we depend on each other to build the library be inquisitive, as early identification of issues can prevent serious consequences and be careful with wording. The code of conduct specifies how breaches can be reported to a code of conduct committee and outlines procedures for the committee’s response. Our diversity statement “welcomes and encourages participation by everyone.”

Maintainers and contributors

100 unique contributors for every 6-month release cycle. Anyone with the interest and skills can become a contributor the SciPy contributor guide (https://scipy.github.io/devdocs/dev/contributor/contributor_toc.html) provides guidance on how to do that. In addition, the project currently has 15 active (volunteer) maintainers: people who review the contributions of others and do everything else needed to ensure that the software and the project move forward. Maintainers are critical to the health of the project 93 their skills and efforts largely determine how fast the project progresses, and they enable input from the much larger group of contributors. Anyone can become a maintainer, too, as they are selected on a rolling basis from contributors with a substantial history of high-quality contributions.

Funding

The development cost of SciPy is estimated in excess of 10 million dollars by Open Hub (https://www.openhub.net/p/scipy/estimated_cost). Yet the project is largely unfunded, having been developed predominantly by graduate students, faculty and members of industry in their free time. Small amounts of funding have been applied with success: some meetings were sponsored by universities and industry, Google’s Summer of Code program supported infrastructure and algorithm work, and teaching grant funds were used early on to develop documentation. However, funding from national agencies, foundations and industry has not been commensurate with the enormous stack of important software that relies on SciPy. More diverse spending to support planning, development, management and infrastructure would help SciPy remain a healthy underpinning of international scientific and industrial endeavors.

Downstream projects

The scientific Python ecosystem includes many examples of domain-specific software libraries building on top of SciPy features and then returning to the base SciPy library to suggest and even implement improvements. For example, there are common contributors to the SciPy and Astropy core libraries 94 , and what works well for one of the codebases, infrastructures or communities is often transferred in some form to the other. At the codebase level, the binned_statistic functionality is one such cross-project contribution: it was initially developed in an Astropy-affiliated package and then placed in SciPy afterward. In this perspective, SciPy serves as a catalyst for cross-fertilization throughout the Python scientific computing community.


Customers who viewed this item also viewed


Modules

jQuery & JavaScript Basics

In this jQuery & JavaScript Basics online training series, you will learn the basics of programming JavaScript™ and using jQuery which we’ve broken down into 2 sections.

jQuery Mobile

With this jQuery Mobile training course, learn the options for local or remote file hosting, creating custom themes for your pages, how to navigate and transition between pages, and much more!

Deep Learning & Computer Vision: An Introduction

Deep Learning & Computer Vision: an Introduction is the perfect course for students who want exposure to Machine Learning. This course will cover topics such as: Artificial Neural Networks, how to install Python, and Handwritten Digit Recognition.

Recommendation Systems in Python

Recommendation Engines perform a variety of tasks, but the most important one is to find products that are most relevant to the user. Follow along with this intensive Recommendation Systems in Python training course to get a firm grasp on this essential Machine Learning component.

Machine Learning – Decision Trees & Random Forests

This Machine Learning: Decision Trees & Random Forests online course will teach you cool machine learning techniques to predict survival probabilities aboard the Titanic – a Kaggle problem!

Twitter Sentiment Analysis in Python

Sentiment Analysis, or Opinion Mining, is a field of Neuro-linguistic Programming that deals with extracting subjective information, like positive/negative, like/dislike, and emotional reactions.In this “Twitter Sentiment Analysis in Python” online course, you’ll learn real examples of why Sentiment Analysis is important and how to approach specific problems using Sentiment Analysis

Factor Analysis with Excel, R and Python

Our ’Factor Analysis’ e-Learning course will help you to understand Factor Analysis and its link to linear regression. See how Principal Components Analysis is a cookie cutter technique to solve factor extraction and how it relates to Machine Learning. Supplemental Materials included!

Linear & Logistic Regression with Excel, R and Python

Our ’Linear & Logistic Regression’ e-Learning course will teach you how to build robust linear models and do logistic regressions in Excel, R, and Python that will be automatically applicable in real world situations.

Quant Trading Using Machine Learning

This ’Quant Trading Using Machine Learning’ online training course takes a completely practical approach to applying Machine Learning techniques to Quant Trading. The focus is on practically applying ML techniques to develop sophisticated Quant Trading models. From setting up your own historical price database in MySQL, to writing hundreds of lines of Python code, the focus is on doing from the get-go.

Apache Storm: Learn by Example

In this ’Apache Storm: Learn by Example’ online course, you will learn how to use Storm to build applications which need you to be highly responsive to the latest data, and react within seconds and minutes, such as finding the latest trending topics on Twitter, or monitoring spikes in payment gateway failures.

SQL Server Development

Learn SQL Server 2005 basics, best practices, dozens of targeted examples, and sample code. If you design or develop SQL Server 2005 databases, this video series is what you need to succeed!

With this C++ Primer training course, presenter Wade Ashby gets you started with setting up an environment and then guides you through defining Variables and Constants. Next, Wade delves into topics such as Loops, Arrays, Strings, and more!


VII. PYTHON APPLICATIONS

There are many applications that are based on Python or offer programming capabilities in Python. As one example, Sage (http://www.sagemath.org) is a freestanding mathematical software package with features covering many aspects of mathematics, including algebra, combinatorics, numerical mathematics, number theory, and calculus. It uses many open-source computation packages with a goal of “creating a viable free open source alternative to Magma, Maple, Mathematica, and Matlab.” It is not an importable Python package, but does offer a convenient Python programming interface. Sage can also be used for cloud-based computing, where a task is distributed to multiple computers at remote sites. Other examples include: Spyder (https://pythonhosted.org/spyder/) a code development and computing environment and Enthought's PyXLL (https://www.pyxll.com/), which allows Python to be embedded into Excel spreadsheets.

The IPython program is one of the nicest tools to emerge out of the Python community (Pérez and Granger, Reference Pérez and Granger 2007). The IPython output (notebook file) architecture is built on top of widely used protocols such as JSON, ZeroMQ, and WebSockets, with CSS for HTML styling and presentation and with open standards (HTML, Markdown, PNG, and JPEG) to store the contents. The IPython program is now starting to be used as a programming environment for other languages, since much of the capabilities are general and because of the use of open standards, other programs can read the notebook files.

The IPython notebook serves many purposes including: code prototyping, collaboration, reviewing documentation, and debugging. It incorporates tools for profiling code, so that if a program is too slow, the (usually short) sections of the code that are taking the most execution time can be quickly identified and reworked. It integrates well with matplotlib and NumPy and can be used as one might employ Matlab, to interactively analyze data, with integrated graphics and computation. Finally, IPython provides a mechanism for remote execution of Python commands. This can be used to distribute a large computation to a supercomputer. Since Python commands and intrinsic data objects are portable, the remote computer can be a completely different architecture from the controller. Perhaps the biggest problem with IPython is how to master all the features.

A. IPython as a shell

Many developers utilize IPython as an interactive shell to run Python commands and rarely invoke Python directly via a command window. It provides many advantages over running Python directly, chiefly, code completion: if a part of a command or function is typed and the tab key is pressed, then all of the possible completions are shown if only one completion exists, the remaining characters are added. It also will show the docstring documentation of functions and classes, if a command is prefixed or followed with a question mark ( ? ).

In IPython, a prompt of In [#] : is used before user input and Out [#] : is shown for values returned from a command. The # symbol is replaced with a number. Both In and Out are dicts that maintain a history of the commands that have been run.

IPython provides a feature called “tab completion” that suggests options to finish typing a command. As an example, if one is not sure how to call the NumPy square-root function and types np.sq and then presses the tab key, IPython responds by showing three possible completions, as is shown below:

Typing an additional r and then pressing tab again causes the t to be added, since that is the only valid completion. Placing a question mark before or after the command causes the docstring for the routine to be displayed, such as:

This documentation then includes:

When IPython is used as an interactive environment for data analysis, the history of commands in dict In can be used as a source of Python commands to be placed in a program to automate that processing.

IPython can also be used to execute Python commands in files. Using the IPython command %run command, the contents of a file can be executed in a “fresh” interpreter or in the current Python context ( %run -i ). The latter case causes variable definitions in the current session prior to the %run command to be passed into the file. Either way, the variables defined during execution of the commands in the file are loaded into the current session. The command %run can also be used for debugging, timing execution, and profiling code.

B. IPython with notebooks

An IPython notebook is an analog to running commands in terminal-based IPython, but using a web browser window (Shen, 2014). Python commands and their results can be saved for future communication and collaboration. To start IPython in this mode, type ipython notebook in the command line on your computer. An IPython notebook is comparable to a spreadsheet with only a single column, but where each cell can contain formatted text or Python code, optionally accompanied with output from the previous commands. As shown in Figure 3, output may contain graphics and plots embedded within the notebook.

Figure 3. (Color online) An example of a computation performed in an IPython notebook showing a simple computation and the graphed result. The IPython notebook file can be shared by e-mail or even within a network, with a secured server.

Hence, an IPython notebook is an excellent way for:

• researchers to collaborate on data analysis,

• instructors to provide worked-through examples of computations and perhaps even homework or exam problems,

• speakers to showcase Python code within their slides in their talks or workshops.

Normally, the IPython notebook web service can only be accessed from the computer where the process is started. With appropriate security protections in place, an IPython notebook server could allow individuals to attack a programming or data analysis problem as a team with all members of the team running Python from the same server using IPython via a web browser. An IPython notebook likewise can be used within a classroom setting to allow a group of students to work collaboratively with Python without having to install the interpreter on their individual computers. There are also free and commercial services that offer access to IPython notebooks over the web, such as Wakari (https://www.wakari.io/wakari). Notebooks can also be freely viewed via the nbviewer service (http://nbviewer.ipython.org/), by entering a link to your gitlab or github repository or a link to your own website that hosts the notebook. The IPython sponsored Jupyter CoLaboratory (https://colaboratory.jupyter.org/welcome/) project allows real-time editing of notebooks, and currently uses Google Drive to restrict access to users with shared permissions.

As an example of how IPython can be used to share a worked-through problem, the computations in the “Performing efficient computations with NumPy” section, above, are provided at this URL (https://anl.box.com/s/hb3ridp66r247sq1k5qh) and the Supplementary Materials, which also includes IPython notebook files demonstrating several other code examples along with their output.


The answer to this question is rather broad since there are several possible algorithms in order to train SVMs. Also packages like LibSVM (available both for Python and R) are open-source so you are free to check the code inside.

Hereinafter I will consider the Sequential Minimal Optimization (SMO) algorithm by J. Pratt which is implemented in LibSVM. Implementing manually an algorithm which solves the SVM optimization problem is rather tedious but, if that's your first approach with SVMs I'd suggest the following (albeit simplified) version of the SMO algorithm

This lecture is from Prof. Andrew Ng (Stanford) and he shows a simplified version of the SMO algorithm. I don't know what your theoretical background in SVMs is, but let's just say that the major difference is that the Lagrange multipliers pair (alpha_i and alpha_j) is randomly selected whereas in the original SMO algorithm there's a much harder heuristic involved.
In other terms, thus, this algorithm does not guarantee to converge to a global optimum (which is always true in SVMs for the dataset at hand if trained with a proper algorithm), but this can give you a nice introduction to the optimization problem behind SVMs.

This paper, however, does not show any codes in R or Python but the pseudo-code is rather straightforward and easy to implement (

100 lines of code in either Matlab or Python). Also keep in mind that this training algorithm returns alpha (Lagrange multipliers vector) and b (intercept). You might want to have some additional parameters such as the proper Support Vectors but starting from vector alpha such quantities are rather easy to calculate.

At first, let's suppose you have a LearningSet in which there are as many rows as there are patterns and as many columns as there are features and let also suppose that you have LearningLabels which is a vector with as many elements as there are patterns and this vector (as its name suggests) contains the proper labels for the patterns in LearningSet .
Also recall that alpha has as many elements as there are patterns.

In order to evaluate the Support Vector indices you can check whether element i in alpha is greater than or equal to 0: if alpha[i]>0 then the i-th pattern from LearningSet is a Support Vector. Similarly, the i-th element from LearningLabels is the related label.

Finally, you might want to evaluate vector w, the free parameters vector. Given that alpha is known, you can apply the following formula

where alpha and LearningLabels are column vectors and LearningSet is the matrix as described above. In the above formula .* is the element-wise product whereas ' is the transpose operator.


Watch the video: Japanese I, Lesson 42 - New Year. Irasshai! (October 2021).