
这是我到目前为止:
!/usr/bin/env pythonimport sysfrom collections import defaultdictimport itertoolsinp = sys.argv[1] # input fasta file; format '>header'\n'sequence'with open(inp,'r') as f: h = [] s = [] for line in f: if line.startswith(">"): h.append(line.strip().split('>')[1]) # append headers to List else: s.append(line.strip()) # append sequences to Listseqs = dict(zip(h,s)) # create dictionary of headers:sequenceprint 'Total Sequences: ' + str(len(seqs)) # Numb. total sequences in input filegroups = defaultdict(List)for i in seqs: groups['_'.join(i.split('_')[1:])].append(seqs[i]) # Create defaultdict with sequences in Lists with IDentical headersdef hamming(str1,str2): """ Simple hamming distance calculator """ if len(str1) == len(str2): diffs = 0 for ch1,ch2 in zip(str1,str2): if ch1 != ch2: diffs += 1 return diffkeys = [x for x in groups]combos = List(itertools.combinations(keys,2)) # Create tupled List with all comparison combinationscombined = defaultdict(List) # Defaultdict in which to place groupsfor i in combos: # Combo = (A1_B1_STRING2,A2_B2_STRING2) a1 = i[0].split('_')[0] a2 = i[1].split('_')[0] b1 = i[0].split('_')[1] # Get A's,B's,C's b2 = i[1].split('_')[1] c1 = i[0].split('_')[2] c2 = i[1].split('_')[2] if a1 == a2 and b1 == b2: # If A1 is equal to A2 and B1 is equal to B2 d = hamming(c1,c2) # Get distance of STRING1 vs STRING2 if d <= 2: # If distance is less than or equal to 2 combined[i[0]].append(groups[i[0]] + groups[i[1]]) # Add to defaultdict by combo 1 keyprint len(combined)for c in sorted(combined): print c,'\t',len(combined[c]) 问题是此代码无法按预期工作.在组合的defaultdict中打印键时;我清楚地看到有许多可以结合起来.但是,组合defaultdict的长度大约是原始大小的一半.
编辑
替代方案没有itertools.combinations:
for a in keys: tocombine = [] tocombine.append(a) tocheck = [x for x in keys if x != a] for b in tocheck: i = (a,b) # Combo = (A1_B1_STRING2,A2_B2_STRING2) a1 = i[0].split('_')[0] a2 = i[1].split('_')[0] b1 = i[0].split('_')[1] # Get A's,C's b2 = i[1].split('_')[1] c1 = i[0].split('_')[2] c2 = i[1].split('_')[2] if a1 == a2 and b1 == b2: # If A1 is equal to A2 and B1 is equal to B2 if len(c1) == len(c2): # If length of STRING1 is equal to STRING2 d = hamming(c1,c2) # Get distance of STRING1 vs STRING2 if d <= 2: tocombine.append(b) for n in range(len(tocombine[1:])): keys.remove(tocombine[n]) combined[tocombine[0]].append(groups[tocombine[n]])final = defaultdict(List)for i in combined: final[i] = List(itertools.chain.from_iterable(combined[i])) 但是,通过这些方法,我仍然缺少一些与其他方法不匹配的方法.
解决方法 我想我看到你的代码有一个问题考虑这个场景:0: A_B_DATA1 1: A_B_DATA2 2: A_B_DATA3 All the valID comparisons are: 0 -> 1 * Combines under key 'A_B_DATA1' 0 -> 2 * Combines under key 'A_B_DATA1'1 -> 2 * Combines under key 'A_B_DATA2' **opps
我想你会想要所有这三个在1键下合并.但请考虑以下情况:
0: A_B_DATA1111: A_B_DATA122 2: A_B_DATA223 All the valID comparisons are: 0 -> 1 * Combines under key 'A_B_DATA111' 0 -> 2 * Combines under key 'A_B_DATA111'1 -> 2 * Combines under key 'A_B_DATA122'
现在它有点棘手,因为第0行是第1行的距离2,第1行是第2行的距离2,但是你可能不希望它们全部在一起,因为第0行距离第2行的距离为3!
下面是一个工作解决方案的示例,假设您希望输出看起来像这样:
def unpack_key(key): data = key.split('_') return '_'.join(data[:2]),'_'.join(data[2:])combined = defaultdict(List)for key1 in groups: combined[key1] = [] key1_ab,key1_string = unpack_key(key1) for key2 in groups: if key1 != key2: key2_ab,key2_string = unpack_key(key2) if key1_ab == key2_ab and len(key1_string) == len(key2_string): if hamming(key1_string,key2_string) <= 2: combined[key1].append(key2) 在我们的第二个例子中,这将导致以下字典,如果这不是您正在寻找的答案,您是否可以输入该示例的最终字典应该是什么?
A_B_DATA111: ['A_B_DATA122']A_B_DATA122: ['A_B_DATA111','A_B_DATA223']A_B_DATA223: ['A_B_DATA122']
请记住,这是一个O(n ^ 2)算法,这意味着当您的密钥集变大时,它不可扩展.
总结以上是内存溢出为你收集整理的Python – 通过密钥中的汉明距离对defaultdict值进行分组全部内容,希望文章能够帮你解决Python – 通过密钥中的汉明距离对defaultdict值进行分组所遇到的程序开发问题。
如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。
欢迎分享,转载请注明来源:内存溢出
微信扫一扫
支付宝扫一扫
评论列表(0条)