python - Comparison between one element and all the others of a DataFrame column -
i have list of tuples turned dataframe thousands of rows, this:
frag mass prot_position 0 tfdehnapnsnsnk 1573.675712 2 1 epganaigmvafk 1303.659458 29 2 gtik 417.258734 2 3 spwpsmar 930.438172 44 4 lpak 427.279469 29 5 nedsfvvweqiinslsalk 2191.116099 17 ...
and have follow rule:
def are_dif(m1, m2, ppm=10): if abs((m1 - m2) / m1) < ppm * 0.000001: v = false else: v = true return v
so, want "frag"s have mass difers other fragments mass. how can achieve "selection"?
then, have list named "pinfo" contains:
d = {'id':id, 'seq':seq_code, "1hw_fit":hits_fit} # 1 each protein # each dictionary position of protein describes.
so, want sum 1 "hits_fit" value, on dictionary respective protein.
if i'm understanding correctly (not sure if am), can accomplish quite bit sorting. first though, let me adjust data have mix of close , far values mass:
unnamed: 0 frag mass prot_position 0 0 tfdehnapnsnsnk 1573.675712 2 1 1 epganaigmvafk 1573.675700 29 2 2 gtik 417.258734 2 3 3 spwpsmar 417.258700 44 4 4 lpak 427.279469 29 5 5 nedsfvvweqiinslsalk 2191.116099 17
then think can following select "good" ones. first, create 'pdiff' (percent difference) see how close mass nearest neighbors:
ppm = .00001 df = df.sort('mass') df['pdiff'] = (df.mass-df.mass.shift()) / df.mass unnamed: 0 frag mass prot_position pdiff 3 3 spwpsmar 417.258700 44 nan 2 2 gtik 417.258734 2 8.148421e-08 4 4 lpak 427.279469 29 2.345241e-02 1 1 epganaigmvafk 1573.675700 29 7.284831e-01 0 0 tfdehnapnsnsnk 1573.675712 2 7.625459e-09 5 5 nedsfvvweqiinslsalk 2191.116099 17 2.817926e-01
the first , last data lines make little tricky next line backfills first line , repeats last line following mask works correctly. works example here, might need tweaked other cases (but far first , last lines of data concerned).
df = df.iloc[range(len(df))+[-1]].bfill() df[ (df['pdiff'] > ppm) & (df['pdiff'].shift(-1) > ppm) ]
results:
unnamed: 0 frag mass prot_position pdiff 4 4 lpak 427.279469 29 0.023452 5 5 nedsfvvweqiinslsalk 2191.116099 17 0.281793
sorry, don't understand second part of question @ all.
edit add: mentioned in comment @amitavory's answer, think possibly sorting approach , groupby approach combined simpler answer this. might try @ later time, should feel free give shot if interested.
Comments
Post a Comment