Identification and extraction of memes represented as semantic networks from free text online forums

Memes have recently come into vogue in the context of ‘viral’ transmission of basic information units in online social networks. However, from their original general definition in a sociological context, there is still much work to be done from an information technology viewpoint. This includes such issues as how to process memes from real text corpus, formal definitions for knowledge representation, meme refinement and selection. In order to address these issues, in this paper we adapt definitions from the semantic network and information retrieval fields to extract memes as semantic networks from free text documents


INTRODUCTION
As defined in a sociological context by Dawkins [1] and Blackmore [2], a meme is understood as a basic element of useful knowledge, or metainformation, which can be transmitted from one individual to another.However, from an information technology point of view, many technical and implementation challenges remain, such as how to identify and extract key memes from free text document corpuses.The study of memes has a high potential utility for understanding and modelling information diffusion/influence in Online Social Networks and applications such as recommender systems [3].Hence the work is motivated by the technical challenges on the one hand, and the potential of the application of the results, on the other.
In this paper the main focus will be on the problem of defining and identifying semantic network type structures in free text, which can then be used to represent memes.We use information retrieval and semantic networks concepts to identify and extract the key memes from a larger candidate set.A simple example is given of an online forum of comment posts to illustrate how the framework could be applied in practise.
The structure of the remainder of the paper is as follows: in the second section we present the state of the art and related work; in the third section we present the definitions for the documents, semantic network concepts (entities and relations) and memes; in the fourth section we given some examples for the definitions of section three; in the fifth section we consider meme metrics and how we can use them to identify the 'top' memes extracted using the definitions and examples of sections three and four; in the final section we give the conclusions.

STATE OF THE ART AND RELATED WORK
The term "meme" was originally defined by Dawkins in [1], and has been recently applied to the study of how information spreads through Internet and Online Social Networks (OSNs).
According to Dawkins [1] a "meme" or a "memetype" is similar to a "gene" or "virus".It consists of a basic unit of information circulating among a community, and research from a social sciences perspective has studied how it serves as a mechanism to propagate cultural and social evolution [4].In [4], Heylighen and Chielens compare the 'meme' with the 'gene' and formalize the following memes properties: 'longevity', the duration that an individual meme survives; 'fecundity', the reproductive activity of a meme; 'copy-fidelity', the degree to which a meme is accurately reproduced.
In [5], Bordogna and Pasi propose a schematic definition for memes using an OWL schema, followed by the definition of several operators to extract memes from online blog posts using information retrieval methods and n-grams (contiguous sequences of n items from a given sequence of text).A fuzzy-type matching is performed to evaluate the fidelity of a given blog post to an original meme description.Finally, the longevity is considered by ordering the text entries by their timestamp and taking into consideration the fidelity.
Leskovec, Backstrom and Kleinberg in [6] developed a framework for tracking short textual memes in an online news media environment, identifying a broad class of memes that exhibit a wide spread and rich variation on a daily basis.Simmons, Adamic and Adar in [7] presented a study about meme mutation in social networks.They uncovered patterns in the rate of appearance of new variants, their length and popularity, and developed a simple model that is able to represents these attributes.Nettleton in [8] presents a wide-ranging survey of OSN analysis, covering themes such as 'influence and recommendation' and 'information diffusion', which includes contextual entity tracking using memes.Baydin and López de Mántaras [9] present an evolutionary algorithm based on the concept of memes.They used semantic networks to represent the individual pieces of information, and employed the 'genetic' concepts of crossover and mutation to model changes over time.Their method was tested on synthetically generated examples.Now we will briefly summarize some of the literature with respect to the extraction of semantic networks from text.Szumlanski and Gomez in [10] extracted semantic networks based on frequency and concept affinity from Wikipedia texts using the WordNet [11] ontology database to identify related concepts.In [12], Jiang and Conrath describe a semantic similarity metric based on corpus statistics and a lexical taxonomy.They present an approach for measuring semantic similarity/distance between words and concepts which uses a distributional analysis of the corpus data.In [13], Chen, Gangopadhyay, Karabatis, McGuire and Welty deals with the elicitation of semantic networks based on concepts relevant to the data mining of specific datasets.In [14], Kok and Domingos, present an unsupervised approach to extracting semantic networks from large volumes of text.They use the TextRunner system [15] to extract tuples from text, and then induce general concepts and relations from them by jointly clustering the objects and relational strings in the tuples.Their approach is defined in Markov logic using four basic rules to extract meaningful semantic networks.

Introduction
In the following we will define the two main data processing steps (processes 1 and 2) and their corresponding definitions (1 to 6).
Process 1: This process acts on the complete document set D to identify the key concepts and relations.It is comprised of Definitions 1 to 3. The objective of Definition 1 is to identify the most relevant subset of documents and key concepts from the complete document corpus.Then, Definitions 2 and 3 identify the relationships between the key concepts.We note that Definitions 1 to 3 act on the complete document corpus D.
Process 2: This process acts on individual documents to compact the semantic networks (eliminate redundant relations and identify the minimal semantic networks).It is comprised of Definitions 4, 5 and 6, which deal with eliminating redundant relations and finding the minimum semantic networks between concepts.We observe that Definitions 4, 5 and 6 act on individual documents d.
We note the importance of the use of thresholds in the processing.The thresholds are determined statistically from the probability distributions of the corresponding metrics.The threshold can be defined by a quartile limit or by point of inflexion, from the corresponding distribution.

Definitions which define the extraction of semantic network as memes Definition 1.
A concept is an n-gram3 (excluding stopwords4 , in the information retrieval sense) that is present in a significant number of documents in a document collection.Formally, let D be the total document collection.Then, an n-gram x i is a concept when it satisfies the condition: where α ∈ [0,1] a is a value known as threshold which is user defined.The threshold α indicates the percentage of documents containing an n-gram to be considered a concept.How is this value chosen?Low values of α will obtain many concepts; on the other hand, higher values of α will obtain fewer concepts.As we consider a document collection which is a free text comments forum, we are interested in those concepts that have most presence in the discussion.As an initial approximation, we could choose a moderately high value for α, in the order of 0.70 ± 0.05.Empirically, we could consider the three highest deciles in a frequency distribution table of candidates for concepts.Other definitions can be found in [16][17].
In a given free text block written by a user of an online community, some concepts will be related to each other, in a way that has meaning for that community.Concepts such as "democracy", "is" and "participation" have little meaning when each of them is taken in isolation, however if they are related by means of a verbal expression (which may be another concept), then they acquire much more meaning.For example, "participation is democracy".
In this context we must determine which concepts are co-occurrences and which are related.
We recall that two concepts are a co-occurrence if they are at a distance of less than n words apart, in the same sentence, or in the same paragraph.In the first case, a limit of 4 can be placed on the value of n; an interesting study on the co-occurrence of words is to be found in Ferrer-i-Cancho and Solé's paper [18].The following Definition 2 provides a way to determine related concepts.
Definition 2. Relationship related (R): Let x i and x j be concepts in a document collection D, where: and | D(x i , x j ) | represents the number of documents in D in which x i , x j are found as co-ocurrences.
Thus, Definition 2 will be true if both concepts appear together in many documents in the collection D. In this case, the threshold γ ∈[0,1] indicates a measure to determine when both concepts are considered related, and is assigned by the user and verified empirically.Again, high values of γ could provide few concepts and low values of γ could provide many related concepts.By extracting the concepts that correspond to verbs, nouns and adjectives, a syntactic and semantic representation of text can be obtained in the form of semantic networks.
A semantic network (SN) is a notation that allows us to represent ideas with meaning and which represent knowledge.An SN is represented by a graph in which the nodes are concepts (nouns and adjectives) and the arcs are the relationships (verbal expressions) between them [19].Figure 1 shows a semantic network with 9 concepts and 3 distinct relationships.This form of notation is used, for example, in the fields of natural language processing and information retrieval [20], among others.is a superset of D(x i ) and we denote D(x j ) → D(x i ), formally: In this case, δ ∈(0,1) can be calculated empirically using the equation: where |D(x)| is the cardinality of set D(x).
From definitions 1, 2 and 3 we have now identified the concepts as the most important terms in a document set.If we consider that x i is a candidate concept and c i is a chosen concept whose frequency in the document set is above the given thresholds, then x i → c i , when the given thresholds α, γ and δ are satisfied.In the following definitions 4 to 6 we will consider a given document d q belonging to document set D.
Definition 4. Relationship redundant.A relationship d q (x i ) →d q (x j ) is redundant if there exist one o more concepts such that d q (x i ) →d q (x k )→…→ d q (x j ) in a semantic network.
Definition 5. Closest Superset.Let C x ={d q (c 1 ), d q (c 2 ), …, d q (c k )} the set of Supersets of x, i.e.C x is the set of all d q (x i ) such that d q (x)→ d q (c 1 ), d q (x)→ d q (c 2 ), …, d q (x)→ d q (c k ).The Closest Superset of x is the smallest of all d q (c i ).
Definition 6. Minimal Semantic Network.Let graph be a semantic network where is a set of concepts and is the set of relationships between concepts.A semantic network with is a minimal semantic network if, for all relationships r k = (c i , c j , Type) ∈ d q (c i ) is the closest superset of d q (c j ), and where 'Type' is the set of possible relations.
With respect to definition 6, we note that each relationship in a semantic network can be expressed by a triple (x i , x j , type) where x i and x i are concepts and "type" is the type of relationship between x i and x j .In [17] Oh, Kin, Park and Yu, proved that in a minimal semantic network the relationships between concepts are not redundant.

EXAMPLES
In this Section, with reference to Figures 2, 3 and  4, we will give an example of each of the aspects we have described in Section 3. We note that the objective of the future work will be to automate the process as much as possible, however we envisage a semi-automatic scheme which may require some manual annotation of the original text and semisupervised processing in other steps, such as in [18].Although these implementation details are out of the scope of the current paper, we can say that in order to construct the semantic networks, we would need to distinguish between the entities and the relations from the initial set of concepts.This could be done using natural language processing software tools and a relationship-instance repository together with WordNet [11], http://wordnet.princeton.edu/, in order to identify entities (e.g.nouns, adjectives) and relations (e.g.verbs, adverbs).
Firstly, in Figure 2 we see a simplified example of a typical online comments forum for a newspaper article.That is, a newspaper publishes an article about a given theme and below the article the registered users are allowed to post their opinions.What typically happens is that users with differing opinions create a debate in which some users state their opinions and other users either support or reject all or part of those opinions.
We observe in Figure 2, that user 1 has posted a comment, which is replied by user 2. Then user 3 posts a new comment, which is replied by user 1, whose comment is in turn replied by user 2. We can clearly see that the central concepts are about foxes and dogs Concepts: correspond to the search terms, which can be entities and/or relations.In Figure 3, the semantic networks formed include the entity concepts 'fox', 'dog', 'brown', 'quick', 'lazy', and the relation concepts 'is', 'jumps-over', 'hunts'.As mentioned, semi-automatic tools exist for identifying syntax structures, however we must not underestimate the difficulty of correctly identifying the relations between entities, especially when a concept has different meanings dependent on the context.For example, the concept 'quick' can be a noun, adjective or adverb.For the present work, we assume a manual revision of the ambiguous cases.In Table 1 we see the concepts, their syntactic classification and the corresponding assignment as entity or relation.Documents: a document is a block of text (comment) written by a user.In information retrieval, if we formulate a query to search for a set of terms (or concepts), such as {fox, dog}, the query will return a set of documents in which one or more (depending if the query is AND or OR) of the query terms appears.Hence, a document will contain one or more concepts which are susceptible to be formed into one or more semantic networks.In Figures 2  and 3 we see there are five documents, designated as d 1 to d 5 .
Semantic network (candidate meme): a semantic network is made up of two or more entity concepts which are related by one or more relation concepts.
A document may contain one or more semantic networks, made up of corresponding concepts.
In Figure 3 we see that we have extracted three significant memes from all the potential semantic networks which can be constructed from the respective texts.Later, in the Section of the Meme Metrics', we will consider how we can use the meme metrics to identify the most significant memes.
Superset-subset: If a set of documents Sd 1 is included within another set of documents Sd 2 then Sd 1 is a subset of Sd 2 and Sd 2 is a superset of Sd 1 .This is related to the information retrieval concept of document retrieval sets corresponding to queries made of one or more query terms.In the current context the query terms would be the concepts making up the memes, that is, each meme is a potential query.With reference to Figure 3 Redundancy: a relation (link) between two concepts is redundant if it already exists via another path.With reference to Figure 4a, we see that the link between 'fox' and 'wolf' is redundant because it is already implicit (inherited) through the links between 'fox', 'dog' and 'wolf'.
Closest superset: the smallest superset with respect to a given subset.Returning to the example of Figure 3, consider three queries, those we defined previously, Sd 1 and Sd 2 , and a new one Sd 3 = {fox, lazy, dog} which returns documents {d 1 , d 3 }.Hence, the smallest superset with respect to Sd 3 will be Sd 1 , as opposed to Sd 2 , given that Sd 2 contains four documents whereas Sd 1 contains only three.
In Table 2 we see the queries and the corresponding document sets.Compact: within a document, all groups of concepts (memes) are connected together by common concepts.With reference to Figure 4b, we see that one unique semantic network has been formed by a (weak) link between two memes (concept groups with strong links).
Minimal: each group of concepts (meme) is separated from any other group of concepts.All links (relation concepts) are designated as being strong.With reference to Figure 4b, we see that two distinct memes (concept groups) are identified.Meme: is a semantic network which is composed of entity concepts with strong links (relation concepts), equivalent to the definition of 'minimal' (above).However, we apply further processing to identify the most relevant memes in a document collection, using the metrics that we will see in the next section.

Incorporation of the Meme Metrics
In this Section we will first describe how we can use the three meme metrics of longevity, fecundity and copy-fidelity, to select the 'top' memes.Then we will give an example of processing using the memes depicted in Figures 2 and 3. We that we perform the meme metric based selection once the minimal memes have been obtained through the semantic network extraction process described previously.

Meme Metric Based Selection Process
In order to perform the meme metric based selection, we represent the users as a directed graph, through which the memes are considered to 'move'.The implementation details of the graph and associated data structures are out of the scope of the present paper.
The selection process is performed in four steps: (i) obtain a value for each of the three metrics for each meme; (ii) obtain the distribution of the values of each metric for all memes; (iii) establish a cut-off point (threshold) for each metric based on their distributions; (iv) identify the memes which are above the thresholds for all metrics.
Step 1: obtain a value for each of the three metrics for each meme.Note 1, graph structure: the representation of the users and the meme transit between the users will require the implementation of the appropriate data structures and data processing procedures.
Step 2: obtain the ordered distribution of the values of each metric for all memes.
2.1.The distribution of the longevity values m L for all memes will be a vector d L .The distribution of the fecundity values m F for all memes will be a vector d F .The distribution of the copy-fidelity values m I for all memes will be a vector d I .
Step 3: establish a cut-off point (threshold) for each metric based on their distributions: 3.1.The threshold for the longevity distribution d L will be designated as λ.The threshold for the fecundity distribution d F will be designated as φ.
The threshold for the copy-fidelity distribution d I will be designated as σ.Note 2, thresholds: there are different statistical techniques we can use to assign the thresholds λ, φ and σ based on the numerical distribution.For example, we can identify an inflexion point, or we can use the top x% percentile, or use a supervised optimization technique.This process could be manual, automatic or semi-automatic.
Step 4: identify the memes which are above the thresholds for all metrics: The output of function MT will be a binary value [0,1] for which 1 signifies that meme m is within all three thresholds and 0 signifies that it is not.We note that we could relax the meme threshold restrictions, to require only two, or just one threshold to be complied with.

Example of meme metric based selection
In this Section we will give a simple example of the meme threshold based selection, with reference to the memes m 1 , m 2 and m 3 shown in Figures 2 and 3. We note that, in this example, time is measured as the number of arcs traversed, and no t the difference between the timestamps.
Hence, m 1 is the only meme which is above all three thresholds and is therefore selected as the top meme based on the metric thresholds.

CONCLUSIONS
In this paper we have given some formal definitions for memes, in terms of information retrieval and semantic network concepts.We have given some examples which illustrate how these definitions can be used to identify, extract and process memes from an online forum.Then we have used the meme metrics to select the memes in terms of importance and quality, for the given document set.This work lays the ground for future work in which we will process large real online forums containing free text documents (comments), and further develop the formal definitions of memes and their behaviour in different scenarios.

Figure 2 .
Figure 2. Online forum example: user's posts, with date and timestamps.

Figure 4 .
Figure 4. (a) Example of a redundant relation in a semantic network; (b) Example of a compact and a minimal semantic network.

Table 1 .
Concepts, syntactic categories and assignments as entity or relation.
, consider the following example: the query {fox, dog, jumps} retrieves the set of documents Sd 1 = {d 1 , d 2 , d 3 }; the query {fox, dog} retrieves the set of documents Sd 2 = {d 1 , d 2 , d 3 , d 4 }, which is a superset of document set Sd 1 .Likewise, Sd 1 is a subset of Sd 2 .

Table 2 .
Queries and document sets.
Copy-fidelity I for a given meme m is designated as m I .m I is equal to the degree of 'loss of fidelity' of a meme over a given time period t or for a given number of arcs traversed.Implementation: a similarity comparison function, with an appropriate distance metric, will be applied to evaluate the fidelity of a given meme (at time t) with respect to the original meme (at time 0).
Consider a meme m whose characteristics mc are embodied as: concept entities {e 1 , ..., e n }, concept relations {r 1 , ..., r m }, longevity m L , fecundity m F and copy-fidelity m I .