It is a process to automatically identify topics present in a text object and to derive hidden patterns exhibited by a text corpus. It assists in better decision making.
LDA’s approach to topic modeling is that, it considers each document as a collection of topics in a certain proportion, and each topic as a collection of keywords, again, in a certain proportion.
Once we provide the algorithm with the number of topics, all it does it to rearrange the topics distribution within the documents and keywords distribution within the topics to obtain a good composition of topic-keywords distribution.
from IPython.display import Image
Image(filename = r"C:\Users\Inspiron 15\Desktop\images\Block Diagram.png", width = 700, height = 300)
pafy 0.5.4 is being used.
Each line contains the URL of the youtube videos which is provided to us file and indices(1,2,3...,50) are removed from each line to get the URL.
Best audio is being downloaded in appropriate location.
Next cell contains the code.
#code for extracting audio from a youtube video
import pafy
fileName = r"...\read.txt" # file from which URLs are fetched
with open(fileName,'r') as f:
for str in f:
url = str.lstrip('0123456789. ') #removing the numbers in front of URL like, 3. https://www.youtube.com/watch?v=0Dc-FPlOsn0 is converted to >>https://www.youtube.com/watch?v=0Dc-FPlOsn0
print(url)
video = pafy.new(url)
bestaudio = video.getbestaudio()
bestaudio.download(r"...\store\\") # path where audi files are stored
print("success ")
https://github.com/Duttabhi/TopicModelling/blob/master/read.txt
Following link is the sample audio file which is fetched from above code. This sample is audio of the first youtube video which is listed in the file. Rest audio files are not uploaded as it require good connectivity.
Audio file is of the form .webm. It is hence converted to .wav format(both .webm and .wav file is uploaded).
https://drive.google.com/drive/u/0/folders/1p9u9MopqoMU-YRukxLfXYImwt4zHyUCj
import os
import speech_recognition as sr
from tqdm import tqdm
with open(r"path_to_json_file\textToSpeech.json") as f:
GOOGLE_CLOUD_SPEECH_CREDENTIALS = f.read()
r = sr.Recognizer()
files = sorted(os.listdir(r'path_to_all_the_chunked_audio_files')) #sorting the chunked files
all_text = []
for f in tqdm(files):
name = r"path_to_all_the_chunked_audio_files" + f # f is name of the files which are fed
# Load audio file
with sr.AudioFile(name) as source:
audio = r.record(source)
# Transcribe audio file
text = r.recognize_google_cloud(audio, credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS)
all_text.append(text) #appending the transcription one after another.
transcript = ""
for i, t in enumerate(all_text):
total_seconds = i * 30
#print(transcript)
with open(file_name_in_which_transcript_will_be_stored, "w") as f: #storing the transcription in appropriate file.
f.write(transcript)
https://github.com/Duttabhi/TopicModelling/blob/master/SampleTranscription.txt
import spacy
spacy.load('en')
from spacy.lang.en import English
parser = English()
def tokenize(text):
lda_tokens = []
tokens = parser(text)
for token in tokens:
if token.orth_.isspace():
continue
elif token.like_url:
lda_tokens.append('URL')
elif token.orth_.startswith('@'):
lda_tokens.append('SCREEN_NAME')
else:
lda_tokens.append(token.lower_)
return lda_tokens
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
def get_lemma(word):
lemma = wn.morphy(word)
if lemma is None:
return word
else:
return lemma
from nltk.stem.wordnet import WordNetLemmatizer
def get_lemma2(word):
return WordNetLemmatizer().lemmatize(word)
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))
def prepare_text_for_lda(text):
tokens = tokenize(text)
tokens = [token for token in tokens if len(token) > 4]
tokens = [token for token in tokens if token not in en_stop]
tokens = [get_lemma(token) for token in tokens]
return tokens
import script
import random
text_data = []
with open(r'C:\Users\Inspiron 15\Desktop\transcription\t.csv') as f: #file in which the data set is present
for line in f:
tokens = script.prepare_text_for_lda(line) #it calls the function mentioned above to prepare the text in each line of data set.
#NOTE: each line is passed as parameter.
if random.random() > .99:
print(tokens)
text_data.append(tokens)
#using gensim for topic modelling. Preparing the LDA topic model.
from gensim import corpora
dictionary = corpora.Dictionary(text_data)
corpus = [dictionary.doc2bow(text) for text in text_data]
import pickle
pickle.dump(corpus, open('corpus.pkl', 'wb'))
# words are being saved in dictionary for future use.
dictionary.save('dictionary.gensim') #saving the dictionary
import gensim
NUM_TOPICS = 10 #to get the top 20 topics from the data-set.
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
ldamodel.save('model5.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
print(topic)
new_doc = '''hi I'm here with Richard Fulcher
Richard is a UX designer at Google
senior UX designer that's right and has
been working on UX for the last 15 years
thank you for joining us Richard it's a
pleasure to be here today thanks for
having me to get started can you
describe what user experience design is
sure it's it's a big umbrella of several
related disciplines that are all focused
on the design of experiences to help a
individual person you know achieve some
kind of goal and that covers a lot of
different types of experiences most
commonly we think about software design
but it's also things like environment
design and maybe the design of physical
products it could even be things like
event coordination it's kind of anything
that a user experiences that can be
constructed for them when I was kind of
first taught this there was kind of five
key concepts it's the study of users and
their context can the environment they
operate in in order that we can design
tools for them to achieve tasks that let
them complete goals so users context
tasks and goals goals okay so let's
start with the user who are they well
the user is anyone who's going to use
the product or the service that you're
designing and it's really important to
remember that that runs a really wide
range you know it's very easy to kind of
when you're designing just think about
designing for yourself or this like one
super idealized user that you have in
mind
but in reality you're gonna be building
this for people with very different
levels of experience with different
contexts and different backgrounds that
affect how they'll perceive this thing
that you build for them all right and
how do I know who the user is
well the key for that is something
called user research I think it's kind
of most associated with this idea of
taking this product you're working on
and and testing it in some way you know
having a user come in and try to perform
a series of tasks
that you assign them and see how they do
and see the things that are problems for
them but user research really covers the
whole end-to-end development of the
product even before you've drawn your
first screen you can engage with user
research to go into your potential users
homes or workplaces to start to
understand you know that context that's
important to the way that they behave
and what their goals are and once you've
even kind of finished and shipped you
can do things like look at logs and try
to analyze usage patterns see what
people are doing with your product
perhaps kind of bias towards things that
you want them to do more of or see areas
that they're spending a lot of time in
that you haven't anticipated
''' #transcript whose topic is to be fetched
new_doc = script.prepare_text_for_lda(new_doc)
new_doc_bow = dictionary.doc2bow(new_doc)
print(new_doc_bow)
print(ldamodel.get_document_topics(new_doc_bow))
#Following results to topics with probabilities. Topic with more probabilty is selected as the suited one.
from IPython.display import Image
Image(filename = r"C:\Users\Inspiron 15\Desktop\images\labels.png", width = 700, height = 300)