python - Compter la fréquence des mots dans un cadre de données pandas

Mots clés : pythonpandasnltkpython

meilleur 3 Réponses python - Compter la fréquence des mots dans un cadre de données pandas

vote vote

100

In [3361]: df.Firm_Name.str.split(expand=True).stack().value_counts() Out[3361]: Society       3 Ltd           2 James's       1 R.X.          1 Yah           1 Associates    1 St            1 Kensington    1 MMV           1 Big           1 &             1 The           1 Co            1 Oil           1 Building      1 dtype: int64 
pd.Series(np.concatenate([x.split() for x in df.Firm_Name])).value_counts() 
pd.Series(' '.join(df.Firm_Name).split()).value_counts() 
In [3379]: pd.Series(' '.join(df.Firm_Name).split()).value_counts()[:3] Out[3379]: Society    3 Ltd        2 James's    1 dtype: int64 
In [3380]: df Out[3380]:       URN                   Firm_Name 0  104472               R.X. Yah & Co 1  104873        Big Building Society 2  109986          St James's Society 3  114058  The Kensington Society Ltd 4  113438      MMV Oil Associates Ltd 
vote vote

88

top_N = 4 #if not necessary all lower a = data['Firm_Name'].str.lower().str.cat(sep=' ') words = nltk.tokenize.word_tokenize(a) word_dist = nltk.FreqDist(words) print (word_dist) <FreqDist with 17 samples and 20 outcomes>  rslt = pd.DataFrame(word_dist.most_common(top_N),                     columns=['Word', 'Frequency']) print(rslt)       Word  Frequency 0  society          3 1      ltd          2 2      the          1 3       co          1 
top_N = 4 a = data['Firm_Name'].str.cat(sep=' ') words = nltk.tokenize.word_tokenize(a) word_dist = nltk.FreqDist(words) rslt = pd.DataFrame(word_dist.most_common(top_N),                     columns=['Word', 'Frequency']) print(rslt)          Word  Frequency 0     Society          3 1         Ltd          2 2         MMV          1 3  Kensington          1 
vote vote

79

from collections import Counter c = Counter() df = pd.DataFrame(     [[104472,"R.X. Yah & Co"],     [104873,"Big Building Society"],     [109986,"St James's Society"],     [114058,"The Kensington Society Ltd"],     [113438,"MMV Oil Associates Ltd"] ], columns=["URN","Firm_Name"]) df.Firm_Name.str.split().apply(c.update)  Counter({'R.X.': 1,          'Yah': 1,          '&': 1,          'Co': 1,          'Big': 1,          'Building': 1,          'Society': 3,          'St': 1,          "James's": 1,          'The': 1,          'Kensington': 1,          'Ltd': 2,          'MMV': 1,          'Oil': 1,          'Associates': 1}) 

Questions similaires