本文共 10788 字,大约阅读时间需要 35 分钟。
python管道和数据共享
By Chris Musselle and Kate Ross-Smith
克里斯·穆瑟尔(Chris Musselle)和凯特·罗斯·史密斯(Kate Ross-Smith)
For a conference in the R language, EARL London 2015 saw a surprising number of discussions about Python. I like to think that at least some of this was to do with the fact that the day before the conference, we ran a 3-hour workshop outlining various strategies for integrating Python and R.
在以R语言召开的会议上,2015年EARL伦敦会议上出现了许多关于Python的令人惊讶的讨论。 我想认为这至少与以下事实有关:会议的前一天,我们举办了一个3小时的研讨会,概述了集成Python和R的各种策略。
This is the first in a series of three blog posts that:
这是三篇博客文章系列中的第一篇:
This post kicks everything off by:
这篇文章通过以下内容开始了一切:
From a quick internet search for articles about “R Python”, of the top 10 results, only 2 discuss the merits of using both R and Python rather than pitting them against each other. This is understandable; from their inception, both have had very distinctive strengths and weaknesses. Historically, though, the split has been one of educational background: statisticians have preferred the approach that R takes, whereas programmers have made Python their language of choice. However, with the growing breed of data scientists, this distinction blurs:
在互联网上快速搜索的有关“ R Python”的文章中,排名前10的结果中,只有2个讨论了同时使用R和Python而不是相互竞争的优点。 这是可以理解的。 从一开始,两家公司就都具有非常鲜明的优势和劣势。 不过,从历史上看,这种分裂一直是教育背景之一:统计学家偏爱R采取的方法,而程序员则选择Python作为他们的选择语言。 但是,随着数据科学家的成长,这种区别变得模糊了:
Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician. —
数据科学家(n。):在统计方面比任何软件工程师都出色并且在软件工程方面比任何统计学家都出色的人。 —
With the wealth of distinct library resources provided by each language, there is a growing need for data scientists to be able to leverage their relative strengths. For example:
每种语言提供了丰富的独特图书馆资源,因此数据科学家越来越需要能够利用其相对优势。 例如:
Python tends to outperform R in such areas as:
Python在以下方面往往胜过R:
Whereas R outperforms Python in such areas as:
而R在以下方面的表现优于Python:
Further, as data science teams now have a relatively wide range of skills, the language of choice for any application may come down to prior knowledge and experience. For some applications – especially in prototyping and development – it is faster for people to use the tool that they already know.
此外,由于数据科学团队现在具有相对广泛的技能,因此任何应用程序选择的语言都可能取决于先验知识和经验。 对于某些应用程序,尤其是在原型开发中,人们可以使用他们已经知道的工具更快。
In this series of posts we are going to consider the simplest strategy for integrating the two languages, and step though it with some examples. Using a flat file as an air gap between the two languages requires you to do the following steps.
在本系列文章中,我们将考虑整合这两种语言的最简单策略,并通过一些示例进行逐步介绍。 使用平面文件作为两种语言之间的气隙,需要执行以下步骤。
Pros
优点
Cons
缺点
Running scripts from the command line via a Windows/Linux-like terminal environment is similar in both R and Python. The command to be run is broken down into the following parts,
在R和Python中,通过类似于Windows / Linux的终端环境从命令行运行脚本是相似的。 要运行的命令分为以下几部分:
where:
哪里:
So for example, an R script is executed by opening up a terminal environment and running the following:
因此,例如,通过打开终端环境并运行以下命令来执行R脚本:
Rscript path/to/myscript.R arg1 arg2 arg3
A Few Gotchas
几个陷阱
In the above example where arg1, arg2 and arg3 are the arguments parsed to the R script being executed, these are accessible using the commandArgs function.
在上面的示例中,其中arg1 , arg2和arg3是解析到正在执行的R脚本的参数,可以使用commandArgs函数访问这些参数。
## myscript.R## myscript.R# Fetch command line argumentsmyArgs # Fetch command line argumentsmyArgs <- <- commandArgscommandArgs (trailingOnly ( trailingOnly = = TRUETRUE ))# myArgs is a character vector of all arguments# myArgs is a character vector of all argumentsprintprint (myArgs( myArgs ))printprint (( classclass (myArgs( myArgs ))))
By setting trailingOnly = TRUE, the vector myArgs only contains arguments that you added on the command line. If left as FALSE (by default), there will be other arguments included in the vector, such as the path to the script that was just executed.
通过设置TrailingOnly = TRUE ,向量myArgs仅包含您在命令行上添加的参数。 如果保留为FALSE (默认设置),向量中将包含其他参数,例如刚刚执行的脚本的路径。
For a Python script executed by running the following on the command line
对于通过在命令行上运行以下命令执行的Python脚本
python path/to/myscript.py arg1 arg2 arg3
the arguments arg1, arg2 and arg3 can be accessed from within the Python script by first importing the sys module. This module holds parameters and functions that are system specific, however we are only interested here in the argv attribute. This argv attribute is a list of all the arguments passed to the script currently being executed. The first element in this list is always the full file path to the script being executed.
通过首先导入sys模块,可以从Python脚本中访问参数arg1 , arg2和arg3 。 该模块包含特定于系统的参数和函数,但是我们仅对argv属性感兴趣。 此argv属性是传递给当前正在执行的脚本的所有参数的列表。 此列表中的第一个元素始终是要执行的脚本的完整文件路径。
If you only wished to keep the arguments parsed into the script, you can use list slicing to select all but the first element.
如果只希望将参数解析到脚本中,则可以使用列表切片来选择除第一个元素以外的所有元素。
# Using a slice, selects all but the first element# Using a slice, selects all but the first elementmy_args my_args = = syssys .. argvargv [[ 11 :]:]
As with the above example for R, recall that all arguments are parsed in as strings, and so will need converting to the expected types as necessary.
与上面R的示例一样,请记住所有参数都被解析为字符串,因此需要将其转换为所需的类型。
You have a few options when sharing data between R and Python via an intermediate file. In general for flat files, CSVs are a good format for tabular data, while JSON or YAML are best if you are dealing with more unstructured data (or metadata), which could contain a variable number of fields or more nested data structures.
通过中间文件在R和Python之间共享数据时,您有几种选择。 通常,对于平面文件而言,CSV是用于表格式数据的一种很好的格式,而JSON或YAML是最好的格式,如果您要处理的非结构化数据(或元数据)可能包含可变数量的字段或更多的嵌套数据结构。
All these are very common , and parsers already exist in both languages. In R the following packages are recommended for each format:
所有这些都是非常常见的 ,并且解析器已经以两种语言存在。 在R中,建议为每种格式使用以下软件包:
And in Python:
在Python中:
The csv and json modules are part of the Python standard library, distributed with Python itself, whereas PyYAML will need installing separately. All R packages will also need installing in the usual way.
csv和json模块是Python标准库的一部分,随Python本身一起分发,而PyYAML将需要单独安装。 所有R软件包也将需要以通常的方式进行安装。
So passing data between R and Python (and vice-versa) can be done in a single pipeline by:
因此,可以在单个管道中通过以下方式在R和Python之间传递数据(反之亦然):
However, in some instances, having to use a flat file as an intermediate data store can be both cumbersome and detrimental to performance. In the next post, we will look at how R and Python can directly call each other and return the output in memory.
但是,在某些情况下,必须将平面文件用作中间数据存储既麻烦又有损于性能。 在下一篇文章中,我们将研究R和Python如何直接相互调用并在内存中返回输出。
翻译自:
python管道和数据共享
转载地址:http://qwhwd.baihongyu.com/