Python 2 vs.3缓慢的os.path.getmtime()具有庞大的文件列表-为什么呢？_python

概述我目前正在成功使用python 2.7脚本,该脚本以递归方式遍历巨大的目录/文件路径,收集所有文件的路径,获取此类文件的mtime以及具有相同路径和名称但pdf的各个文件的mtime文件进行比较.我在python 2.7脚本中使用scandir.walk(),在python 3.7中使用os.walk(),最近已更新为也使用scandir算法(无其他stat

我目前正在成功使用python 2.7脚本,该脚本以递归方式遍历巨大的目录/文件路径,收集所有文件的路径,获取此类文件的mtime以及具有相同路径和名称但pdf的各个文件的mtime文件进行比较.我在python 2.7脚本中使用scandir.walk(),在python 3.7中使用os.walk(),最近已更新为也使用scandir算法(无其他stat()调用).

但是,脚本的python 3版本仍然明显慢一些！这不是由于算法的scandir / walk部分造成的,而是由于getmtime算法(但是,在python2和3中是相同的调用)或由于处理了巨大的列表(我们在说〜).此列表中有500.000个条目).

任何想法可能导致此问题以及如何解决此问题？

@H_403_9@

#!/usr/bin/env python3## imports#import sysimport timefrom datetime import datetimeimport osimport re## MAIN THREAD#if __name__ == '__main__':    source_dir = '/path_to_data/'    # Get file List    files_List = []    for root,directorIEs,filenames in os.walk(source_dir):        # Filter for extension        for filename in filenames:            if (filename.lower().endswith(('.msg','.doc','.docx','.xls','.xlsx'))) and (not filename.lower().startswith('~')):                files_List.append(os.path.join(root,filename))    # Sort List    files_List.sort(reverse=True)    # For each file,the printing routine is performed (including necessity check)    all_documents_counter = len(files_List)    for docfile_abs in files_List:        print('\n' + docfile_abs)        # define files        filepathname_abs,file_extension = os.path.splitext(docfile_abs)        filepath_abs,filename = os.path.split(filepathname_abs)        # If the filename does not have the format # # # # # # # *.xxx (e.g. seven numbers),then it is checked whether it is referenced in the databse. If not,it is moved to a certain directory        if (re.match(r'[0-9][0-9][0-9][0-9][0-9][0-9][0-9](([Aa][0-9][0-9]?)?|(_[0-9][0-9]?)?|([Aa][0-9][0-9]?_[0-9][0-9]?)?)\...?.?',filename + file_extension) is None):            if any(Expression in docfile_abs for Expression in ignore_subdirs):                pass            else:                print('Not in database')        # DOC        docfile_rel = docfile_abs.replace(source_dir,'')        # Check pdf        try:            pdf_file_abs = filepathname_abs + '.pdf'            pdf_file_timestamp = os.path.getmtime(pdf_file_abs)            check_pdf = True        except(fileNotFoundError):            check_pdf = False        # Check pdf        try:            pdf_file_abs = filepathname_abs + '.pdf'            pdf_file_timestamp = os.path.getmtime(pdf_file_abs)            check_pdf = True        except(fileNotFoundError):            check_pdf = False        # Check whether ther are lowercase or uppercase extension and decIDe what to do if there are none,just one or both present        if (check_pdf is True) and (check_pdf is False):            # Lower case case            pdf_extension = '.pdf'            pdffile_timestamp = pdf_file_timestamp        elif (check_pdf is False) and (check_pdf is True):            # Upper case case            pdf_extension = '.pdf'            pdffile_timestamp = pdf_file_timestamp        elif (check_pdf is False) and (check_pdf is False):            # None -> set timestampt to zero            pdf_extension = '.pdf'            pdffile_timestamp = 0        elif (check_pdf is True) and (check_pdf is True):            # Both are present,decIDe for the newest and move the other to a directory            if (pdf_file_timestamp < pdf_file_timestamp):                pdf_extension = '.pdf'                pdf_file_rel = pdf_file_abs.replace(source_dir,'')                pdffile_timestamp = pdf_file_timestamp            elif (pdf_file_timestamp < pdf_file_timestamp):                pdf_extension = '.pdf'                pdf_file_rel = pdf_file_abs.replace(source_dir,'')                pdffile_timestamp = pdf_file_timestamp        # Get timestamps of doc and pdf files        try:            docfile_timestamp = os.path.getmtime(docfile_abs)        except OSError:            docfile_timestamp = 0        # Enable this to force a certain period to be printed        DateBegin = time.mktime(time.strptime('01/02/2017',"%d/%m/%Y"))        DateEnd = time.mktime(time.strptime('01/03/2017',"%d/%m/%Y"))        # Compare stimestamps and print or not        if (pdffile_timestamp < docfile_timestamp) or (pdffile_timestamp == 0):            # Inform that there should be printed            print('\tpdf should be printe.')        else:            # Inform that there was no need to print            print('\tpdf is up to date.')    # Exit    sys.exit(0)

最佳答案不知道是什么原因解释了差异,但是即使将os.walk增强为使用scandir,它也不会扩展到进一步的getmtime调用,后者将再次访问文件属性.

最终目标是根本不调用os.path.getmtime.

os.walk中的加速是关于不两次执行统计信息以了解对象是目录还是文件.但是内部的DirEntry对象(由scandir生成)从未公开,因此您无法重用它来检查文件时间.

如果您不需要重新启动,可以使用os.scandir完成：

@H_403_9@

for dir_entry in os.scandir(r"D:\some_path"):    print(dir_entry.is_dir())  # test for directory    print(dir_entry.stat())    # returns stat object with date and all

循环内的那些调用以零成本完成,因为DirEntry对象已经缓存了此信息.

因此,要保存getmtime调用,您必须递归获取DirEntry对象.

没有本地方法,但是这里有示例,例如：How do I use os.scandir() to return DirEntry objects recursively on a directory tree?

这样,您的代码在python 2和python 3中将更快,因为每个对象只有1个stat调用,而不是2.

编辑：编辑以显示代码后,似乎您正在从其他条目中构建pdf名称,因此您不能依赖DirEntry结构来获取时间,甚至不能确定文件是否存在(即使您正在使用windows,因为文件名不区分大小写,因此无需测试pdf和pdf).

最好的策略是建立一个包含相关时间和所有时间的大型文件数据库(使用字典),然后对其进行扫描.我已成功使用此方法在3500万个文件缓慢的网络驱动器上查找旧文件/大文件.在我的个人示例中,扫描文件一次,然后将结果转储到一个大的csv文件中(花了几个小时,获取了6Gb的csv数据),然后进行了进一步的后处理,加载了数据库并执行了各种任务(由于没有磁盘访问,因此速度更快参与)

总结

以上是内存溢出为你收集整理的Python 2 vs.3缓慢的os.path.getmtime()具有庞大的文件列表-为什么呢？全部内容，希望文章能够帮你解决Python 2 vs.3缓慢的os.path.getmtime()具有庞大的文件列表-为什么呢？所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错，欢迎将内存溢出网站推荐给程序员好友。

欢迎分享，转载请注明来源：内存溢出

原文地址:https://54852.com/langs/1199553.html

Python 2 vs.3缓慢的os.path.getmtime()具有庞大的文件列表-为什么呢？

发表评论

评论列表（0条）