Biblio
By representing large corpora with concise and meaningful elements, topic-based generative models aim to reduce the dimension and understand the content of documents. Those techniques originally analyze on words in the documents, but their extensions currently accommodate meta-data such as authorship information, which has been proved useful for textual modeling. The importance of learning authorship is to extract author interests and assign authors to anonymous texts. Author-Topic (AT) model, an unsupervised learning technique, successfully exploits authorship information to model both documents and author interests using topic representations. However, the AT model simplifies that each author has equal contribution on multiple-author documents. To overcome this limitation, we assumes that authors give different degrees of contributions on a document by using a Dirichlet distribution. This automatically transforms the unsupervised AT model to Supervised Author-Topic (SAT) model, which brings a novelty of authorship prediction on anonymous texts. The SAT model outperforms the AT model for identifying authors of documents written by either single authors or multiple authors with a better Receiver Operating Characteristic (ROC) curve and a significantly higher Area Under Curve (AUC). The SAT model not only achieves competitive performance to state-of-the-art techniques e.g. Random forests but also maintains the characteristics of the unsupervised models for information discovery i.e. Word distributions of topics, author interests, and author contributions.
Complex event processing has become an important technology for big data and intelligent computing because it facilitates the creation of actionable, situational knowledge from potentially large amount events in soft realtime. Complex event processing can be instrumental for many mission-critical applications, such as business intelligence, algorithmic stock trading, and intrusion detection. Hence, the servers that carry out complex event processing must be made trustworthy. In this paper, we present a threat analysis on complex event processing systems and describe a set of mechanisms that can be used to control various threats. By exploiting the application semantics for typical event processing operations, we are able to design lightweight mechanisms that incur minimum runtime overhead appropriate for soft realtime computing.
With the rapid increase in cloud services collecting and using user data to offer personalized experiences, ensuring that these services comply with their privacy policies has become a business imperative for building user trust. However, most compliance efforts in industry today rely on manual review processes and audits designed to safeguard user data, and therefore are resource intensive and lack coverage. In this paper, we present our experience building and operating a system to automate privacy policy compliance checking in Bing. Central to the design of the system are (a) Legal ease-a language that allows specification of privacy policies that impose restrictions on how user data is handled, and (b) Grok-a data inventory for Map-Reduce-like big data systems that tracks how user data flows among programs. Grok maps code-level schema elements to data types in Legal ease, in essence, annotating existing programs with information flow types with minimal human input. Compliance checking is thus reduced to information flow analysis of Big Data systems. The system, bootstrapped by a small team, checks compliance daily of millions of lines of ever-changing source code written by several thousand developers.